Random surfer

Computer Science Level 4

We look at a part of the Internet that consists of 10 web pages, numbered from 1 to 10 here. This network is represented by a directed graph (see above), whose arrows symbolize the links between the web pages. We're modeling a random surfer, who moves in this part of the internet. The surfer obeys the following rules:

The surfer starts on a random website.
The surfer clicks on a random link to get to the next webpage (all links are equally likely).

Which web page is the most visited by the random surfer, when he or she is on the network for a long time?

1 4 5 7 10

4 solutions

Markus Michelmann
May 10, 2018

We consider the probability $p_i ^ {(k)}$ that the surfer is in the $k$ - th step on the web page $i$ . Since the first website is chosen completely random, we have an equal distribution at the beginning: $p_i ^ {(0)} = \frac {1} {n} = 10 \, \%$ with the number $n = 10$ of the web pages. If the probabilities $p_i ^ {(k)}$ for the $k$ - th step are known, then the probabilities for the next step can be calculated by $p_i ^ {(k + 1)} = \sum_ {j \to i} \frac{1}{l_j} p_j ^ {(k )}$ The sum goes over all web pages $j$ , which have a link to the web page $i$ . $l_j$ is the number of links, that start from the page $j$ . In order to jump from $j$ to $i$ , the surfer must first have been on the page $j$ (with probability $p_j ^ {(k )}$ ) and selected the link to the page $i$ (with probability $1 / l_j$ ). (This formula is Google's PageRank algorithm with the damping factor $d = 1$ .)

The Python program below computes the probability distribution of the network for the first 25 steps. After just 15 steps, the values begin to converge. In the last step, we get the result $\text{prob} \approx [0.0738, 0.0369, 0.0185, 0.1291, 0.1935, 0.0645, 0.258, 0.0645, 0.0322, 0.129]$ Therefore, we get the highest probability $p \approx 26 \, \%$ for the website 7.

steps = 25
web_size = 10
web = [[2,4], [3,4], [1,4], [1,5], [7], [7], [4,5,6,10], [9,10], [10], [5,8]]
   # 'web' describes the links in the network, 'web[i]' is a list of links outgoing from page i+1
prob = [1./web_size for i in range(web_size)]
  # initialize equal distribution for the probabilities 'prob[i]' for page i
print([round(prob[i],4) for i in range(web_size)])

# iteration over 25 steps:
for n in range(steps):
    new_prob = [0 for i in range(web_size)]      #initialize new probability distribution
    for i in range(web_size):                    #iterate over all outgoing pages
        for j in web[i]:                         #iterate over all links of page i
            new_prob[j-1] += prob[i]/len(web[i]) #calculate new probability
    prob = new_prob[:]                           #copy probability distribution
    print([round(prob[i],4) for i in range(web_size)])

Antoine Crbs
Nov 14, 2018

Simple solution using Python 3

import random
m,ar=[[1,3],[2,3],[0,3],[0,4],[6],[6],[3,4,5,9],[8,9],[9],[4,7]],[0 for d in range(10)]
for a in range(10): # test for all possible start
    s=a
    for b in range(10**4): 
        s=random.choice(m[s])
        ar[s]+=1
print(ar.index(max(ar))+1)

output: 7

Vlad Vasilescu
Jul 8, 2018

As the range for i is increased , a.k.a , the surfer is on the network for a long time , the number of visits on the $\boxed{7^{th}}$ site exceeds the others . Examples of output :

for i=1000 : [60, 36, 22, 121, 169, 67, $\boxed{274}$ , 74, 32, 145]
for i=10000 : [436, 191, 95, 1163, 1856, 725, $\boxed{2976}$ , 727, 366, 1465]
for i=100000 : [4267, 2129, 1104, 11033, 18426, 7433, $\boxed{29473}$ , 7500, 3741, 14894]

Peter van der Linden
May 10, 2018

Since I expected a lot of calculations(which is seen in @Markus Michelmann solution), I tried to make an intuitive reasoning. If you take a look at the graph, you can observe that website 7 will be reached some time(no matter where the surfer starts). 2 out of 4 sites that are linked from 7 will make the surfer come back in at most 2 steps. Website 5 and 6 only have a direct, outgoing link to 7. Websites 4 and 10 link back to 7 by a little loop. In the long term this means that website 7 becomes a sort of crossroad, meaning it's visited the most.

In principle, you are right that you can intuitively examine the network. Important are two points

5 is especially popular because all popular websites 4, 7 and 10 have a link to 5.
5 has only one link to 7

It immediately follows, that there are at least as many visits on the website 7 as on website 5. Since we have also the loop $7 \leftrightarrow 6$ , we can visit 7 multiple times in row without going over website 5. Therefore, website 7 must be more popular than 5.

This argument no longer works if you just add a single link. If you had the link $5 \to 10$ , then website 5 would be the most popular site.

Markus Michelmann - 3 years ago

Random surfer

4 solutions

0 pending reports