How to Download Sports Data
So you really want to dominate your fantasy league this year. You’ve been looking through some of the rankings provided by some of the major fantasy sports sites online, but they don’t seem to fit to your league’s settings. You’ve been brushing up on your Excel skills so if only you could get your hands on some data you could do your own analysis and rank players based on your own set of criteria. Or maybe you’re trying to decide who to select in your weekly pick ’em pool and you want to base your selection off some objective data. The Import Data function in Excel doesn’t work well with some of the sites you’ve been trying to pull data from so you’re stuck. Sound familiar? Fortunately, you’re in luck. There are a few different ways that you can easily get data out of any website and into your own Excel spreadsheet. Here is an overview of the methods I use starting with the most simple.
1. Microsoft Excel Import Data
The “Import Data from Web” functionality in Excel is probably the easiest way to get sports data into a spreadsheet.
Simply click on the “Data” tab in Excel and then select “From Web” as your data connection (the screenshot is from Excel 2013, but the same functionality should be there going back to Excel 2007 on Windows. Excel for Mac does not seem to have direct connection to the web, but you could download an HTML file and then connect directly to that static file). Then you’ll input the URL where your data currently lives on the web and you should see a small yellow arrow next to the table on the webpage. Click that arrow and then “Import” and you should see all the sports data dumped into your spreadsheet. Once you have that connection established in your Excel sheet every time you come back you can choose to “Refresh All” so that any web data source that your worksheet connects to will be refreshed to the latest data.
Sometimes the way the web page is formatted prevents Excel from importing the data with this method. If the web page is using a special plug-in or tool to display their data (that’s not simply text in a table format on a page) then this method will not work. Also, if you the web page you’re pulling from is not always the same URL (due to query variables, or day-to-day or week-to-week differences) there’s not an easy way to dynamically change the URL. If this is a problem for the particular task you’re trying to accomplish take a look at the next method.
2. Google Docs – ImportHTML
From a pure functionality standpoint Excel is head and shoulders above Google Docs/Google Spreadsheets. There are many formatting advantages and a broader list of functions to choose from in Excel, but when it comes to interacting with data on the web – Google Spreadsheets blows Excel away in my opinion. The most important function for this purpose (that does not exist in Excel) is =importHTML. An example of an importHTML function would be:
There are a few things to think about to make sure you get the syntax right:
- Always start the function with an = sign
- The first argument should be the URL of your data and it always gets wrapped in quotation marks. However if you’re placing your URL in a different cell in the spreadsheet and then referencing that cell in the importHTML argument then quotation marks are not needed
- The second argument should be either “table” or “list” (wrapped in quotes). This is dependent on how the web page is structured. I would say from experience that 95% of pages that display data use tables rather than lists, but it’s something to play around with to see what works for your particular page.
- The final argument is the number of the table (or list) displayed on the page. If there is only one table on the page then it’s very likely in the first position which would be a value of 0. If there are multiple tables on the page and you want to grab the 4th table then you would enter 3 (0,1,2,3). If you’re not getting the data you want you can try changing this number to see if you can find what you’re really after.
The data returned from this function will automatically refresh every time you open the spreadsheet. This can be really useful when you have a web page that constantly updates stats or game results throughout the season and you want to build a model that is designed to make predictions continually.
This function can become extremely powerful when you want to access a data source from which the URL changes (either over time or depending on the variables you choose). You can build a function in Excel that uses =concatenate that breaks out variables in different cells and then dynamically builds the final URL based on whatever you put in the input cells. Now you can reference your concatenated cell that holds your final URL in the importHTML function. This is useful on sites like www.basketball-reference.com and any others on the Sports Reference network. It’s also valuable for fantasy football rankings pages that have something like “week1” in the URL that would change from week to week.
There are still some limitations to this function that prevent it from working with certain web pages. Sites that are password protected or secured in some way will not return any data. Also, sites that use an embedded tool or plug-in to display data (instead of displaying it as a table or list) will not work with this formula. Unfortunately, when you try to pull data from a page that has an issue it typically causes errors in the spreadsheet even after you delete the formula. If you have a spreadsheet that contains a lot of formulas or important data it’s probably best to test out your importHTML function on a separate blank sheet to make sure it works and doesn’t cause problems with your valuable sheet.
3. Chrome Scraper Extension + ImportXML
Finally, the last tool that might be the easiest to use is the Chrome Web Scraper. This is a plug-in that works with Google Chrome that allows you to point and click on any web page and export the data into Google Docs. You’ll need to install it as a Chrome extension and then it’s ready to use. To use the scraper, scroll across the first few rows of data from the web page you’re interested in and then right click and choose “Scrape Similar.” You should see a screen that looks like this:
The example above is from the same ESPN page from the Excel example and as you see there’s a “Export to Google Docs” button in the bottom right. Simply click that button and you’ll have all the data that you see within the Scraper interface in your spreadsheet. If you don’t like how the results look try scrolling over and selecting the data in a different way (then right clicking and selecting “Scrape Similar”) to get it into the correct format.
This tool is used more useful for a one-time analysis than it is to continually pull data. However, the other nice thing that this tool does is give you the XPath reference for the data on the page. Xpath is a query language for XML data and it’s not something I’m terrible comfortable with. There are many resources out there that can offer a much better explanation of XPath than I can. For the purposes of pulling sports data into a spreadsheet it can be useful if none of the other options work with your page.
You can take the XPath reference that the Scraper shows you and plug that string of text into the =importXML function within Google Docs. There are 2 arguments for the importXML function. First is the URL (enclosed in quotes) and second is the XPath Reference that the Scraper identifies for you (also in quotes). Now you’ve programmatically pulled in your data and it will be refreshed each time you open up the spreadsheet.
If you have questions on any of these methods to pull sports data into a spreadsheet feel free to add them in the comments. Every situation is different and I know how frustrating it can be when you expect something to work and it doesn’t. Keep in mind though that certain web pages are simple impossible to pull from and you may have to resort to simple copy and pasting or attacking the problem from a more technical perspective.