# The Web

When we talked about reading and writing files, we mentioned that it was a convenient abstraction that our computer relies on.
Indeed, reading data from remote machines is very similar to reading files from the local filesystem.


From the computer's perspective, the network is just another file that can be read/written.
Based on the **protocol** we're using, various layers of abstraction will ensure that we connect to the right machine and read the right content.


One such protocol is HTTP, or hypertext transport protocol.
This protocol was designed to allow us to read a filepath from a remote machine.
The combination of `http://<machine>/<path>` is called a URL, or uniform resource locator.


This setup was primarily designed to serve a visual design language, or *markup language*, called HTML - hypertext markup language.
HTML is a hierarchical language that certain applications, like web browsers, know how to read and display graphically.
The modern web includes much more than this, including an advanced HTML styling language (CSS), and an embedded dynamic runtime engine that can safely execute code provided by a remote server (that embedded runtime executes JavaScript).

Even though it is not a web browser, Python can read and write "files" via HTTP.
To do so, we can use a library called `requests` (there are many other libraries to do so).
Let's start by reading the course website:

In [None]:
import requests

In [None]:
resp = requests.get("https://cs.columbia.edu/~paine/1006/")
resp.content

There's a lot happening in a small amount of code:

- Use protocol `https` (secure `http`)
- Connect to the server called `cs.columbia.edu`
- Read the path `~/paine/1006`
- Inspect the content

An HTTP response can be successful, or any number of things might go wrong including redirects, missing paths, or server errors.
Notice that the response content is an HTML document - this is what our browser would display as the course homepage.

Some HTTP endpoints serve documents like this designed for display, while others might serve files like the `csv` files we've read previously in the class.


# Reading content out of HTML

If the server provides us an HTML document, we can pull out tables with `pandas`.
The course homepage has a few tables for homeworks, schedule, etc, and we can use `pandas.read_html` to find them.
This function will return a list of all found tables as `pandas.DataFrame`.

In [None]:
import pandas as pd
pd.read_html("https://cs.columbia.edu/~paine/1006/")[0]

In [None]:
pd.read_html("https://cs.columbia.edu/~paine/1006/")[1]

Note that we couldve also provided the text content from our previous request

In [None]:
res = requests.get("https://cs.columbia.edu/~paine/1006")
pd.read_html(res.text)[0]

Some web servers provide data as a different kind of file, like a CSV or JSON file.
Let's query one such webserver, from the St Louis Federal Reserve.
This endpoint provides CSV data of the inflation rate.

In [None]:

url = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23ebf3fb&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1320&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=T5YIE&scale=left&cosd=2015-04-29&coed=2025-04-29&line_color=%230073e6&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Daily&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2025-04-29&revision_date=2025-04-29&nd=2003-01-02"

In [None]:
df = pd.read_csv(url)

In [None]:
df.set_index("observation_date", inplace=True)
df.plot()


When data is provided in this form, it is often extremely easy to work with.

Let's look at another example of a hard to work with data source.
In this case, we'll look at some ESPN websites with multiple tables.
First, we look to see if we can query it at all:

In [None]:
requests.get('https://www.espn.com/nba/team/stats/_/name/cle').text[:100]

Now let's try to stitch together the different tables on the page

In [None]:
df = pd.read_html("https://www.espn.com/nba/team/stats/_/name/cle")
df[0]

In [None]:
df[1]

In [None]:
combined_df = pd.concat((df[0], df[1]), axis=1)
combined_df

# Writing our own webserver

Now that we've started playing around as a client, let's look at how we'd write our own server.
We'll use an old-but-stable library called Flask to serve our `combined_df` at the path `/`.


Once we run the webserver, we can connect to the printed link to see our data.

In [None]:
from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    return combined_df.to_html()

In [None]:
app.run()

We can also serve other types of files.
Let's write some code to generate an image with `matplotlib` and serve it at the path `/picture`.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from flask import Flask, send_file
app = Flask(__name__)

@app.route('/')
def hello_world():
    return combined_df.to_html()

@app.route('/picture')
def hello_world2():
    plt.figure(figsize=(10, 6))
    plt.scatter(np.random.rand(50), np.random.rand(50), c=np.random.rand(50))
    plt.xlabel("X-axis")
    plt.ylabel("Y-axis")
    plt.title("Random Scatter Plot")
    plt.savefig('test.png')
    return send_file('test.png', mimetype='image/png')

In [None]:
app.run()