Selenium has always been one of the most popular and robust ways to automate browser-based work and testing. There are a variety of methods used for element selection when using the Selenium WebDriver.

Product Parsing

Consider the following snippet that creates an instance of Chrome and browses to the Amazon front page and gathers every link it can find:

# filename: profiler_test.py
from selenium import webdriver

CHROME_DRIVER_PATH = 'chromedriver.exe'


def main():  
    d = webdriver.Chrome()

    d.get("https://www.amazon.com")
    d.maximize_window()
    get_all_pages_links(d)
    d.quit()


def get_all_pages_links(d):  
    hrefs = []
    links = d.find_elements_by_tag_name('a')
    for link in links:
        href = link.get_attribute('href')
        hrefs.append(href)
    return hrefs


if __name__ == '__main__':  
    main()

This is the page we are seeing by the way:

Line Profiling

Now the question is, how long did it take us to get the href for every a tag present on this page? If you're curious, there were 428 total links present on this page alone. That is a lot of searching for Selenium's find_elements_by_tag_name() to do. One way to determine how fast our code is running is by using something called a line profiler. We add a @profile decorator to both main() and get_all_pages_links() before running our profiler. Via the command line, we run the test with the command kernprof.exe -l -v profiler_test.py. Testing this with a profiler, the results are:

Wrote profile results to profiler_test.py.lprof  
Timer unit: 4.26667e-07 s

Total time: 15.8145 s  
File: profiler_test.py  
Function: main at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents  
==============================================================
     8                                           @profile
     9                                           def main():
    10         1      8780838 8780838.0     23.7      d = webdriver.Chrome()
    11
    12         1      2672345 2672345.0      7.2      d.get("https://www.amazon.com")
    13         1      2871840 2871840.0      7.7      d.maximize_window()
    14         1     15225302 15225302.0     41.1      get_all_pages_links(d)
    15         1      7514945 7514945.0     20.3      d.quit()

Total time: 6.49449 s  
File: profiler_test.py  
Function: get_all_pages_links at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents  
==============================================================
    18                                           @profile
    19                                           def get_all_pages_links(d):
    20         1            3      3.0      0.0      hrefs = []
    21         1      3620538 3620538.0     23.8      links = d.find_elements_by_tag_name('a')
    22       428         1198      2.8      0.0      for link in links:
    23       427     11595822  27156.5     76.2          href = link.get_attribute('href')
    24       427         3875      9.1      0.0          hrefs.append(href)
    25         1            1      1.0      0.0      return hrefs

We see that it took nearly 15.8 seconds for the Chrome instance to start, to navigate to the Amazon page, to gather all links, and then to quit. What was the longest part of this runtime? It was line 23, the one where we ask for each a tag's href. In fact, this gathering of href values account for 76% of our runtime within the get_all_pages_links() function (about 4.94 seconds or more than one-fourth of the total script runtime). This is far too long in terms of automatic web testing. Imagine you needed to gather all links from hundreds and thousands of web pages? The process would take hours for even simple websites.

Utilizing XPath

How can we do better? What other functions can we call from our webdriver.Chrome class such that we can make finding tags and their attributes faster? What if we tried to find all the a tags and their hrefs using XPath? Let's try it. Our code now looks like:

# filename: profiler_test.py

from selenium import webdriver

CHROME_DRIVER_PATH = 'chromedriver.exe'


@profile
def main():  
    d = webdriver.Chrome()

    d.get("https://www.amazon.com")
    d.maximize_window()
    get_all_pages_links(d)
    d.quit()


@profile
def get_all_pages_links(d):  
    hrefs = []
    links = d.find_elements_by_xpath('//a')
    for link in links:
        href = link.get_attribute('href')
        hrefs.append(href)
    return hrefs


if __name__ == '__main__':  
    main()

And the results from running with the profiler:

Wrote profile results to profiler_test.py.lprof  
Timer unit: 4.26667e-07 s

Total time: 16.262 s  
File: profiler_test.py  
Function: main at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents  
==============================================================
     8                                           @profile
     9                                           def main():
    10         1      8203882 8203882.0     21.5      d = webdriver.Chrome()
    11
    12         1      8535178 8535178.0     22.4      d.get("https://www.amazon.com")
    13         1      2910691 2910691.0      7.6      d.maximize_window()
    14         1     10979849 10979849.0     28.8      get_all_pages_links(d)
    15         1      7484319 7484319.0     19.6      d.quit()

Total time: 4.68298 s  
File: profiler_test.py  
Function: get_all_pages_links at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents  
==============================================================
    18                                           @profile
    19                                           def get_all_pages_links(d):
    20         1            3      3.0      0.0      hrefs = []
    21         1       317767 317767.0      2.9      links = d.find_elements_by_xpath('//a')
    22       431         1092      2.5      0.0      for link in links:
    23       430     10652981  24774.4     97.1          href = link.get_attribute('href')
    24       430         3875      9.0      0.0          hrefs.append(href)
    25         1            2      2.0      0.0      return hrefs

The same gathering operation now takes around 4.5 seconds. This is an improvement of about half a second. Although good, this is not helpful in the bigger picture. We need these links, and we need them fast. So here's one last idea.

Scripts Within Scripts

Selenium WebDrivers have a function called execute_script(), which as the name implies injects (JavaScript) code into the browser and returns whatever result that the code does. How about we check to see if directly injecting such a code is faster at getting all the page's links when compared to using get_attribute()? The raw JavaScript code to retrieve all page href values is coded as such:

function get_all_hrefs() {  
    var anchors = document.links;
    var hrefs = [];
    for (var i=0; i<anchors .length; i++) {
        hrefs.push(anchors[i].href);
    }

    return hrefs;
}

We re-write profiler_test.py to use this function:

# filename: profiler_test.py

from selenium import webdriver

CHROME_DRIVER_PATH = 'chromedriver.exe'

JS_GET_ALL_HREFS = r'''function get_all_hrefs() {  
    var anchors = document.links;
    var hrefs = [];
    for (var i=0; i<anchors .length; i++) {
        hrefs.push(anchors[i].href);
    }

    return hrefs;
}
return get_all_hrefs();  
'''

@profile
def main():  
    d = webdriver.Chrome()

    d.get("https://www.amazon.com")
    d.maximize_window()
    get_all_pages_links(d)
    d.quit()


@profile
def get_all_pages_links(d):  
    hrefs = d.execute_script(JS_GET_ALL_HREFS)
    return hrefs


if __name__ == '__main__':  
    main()

And lastly, we test with the line profiler:

Wrote profile results to profiler_test.py.lprof  
Timer unit: 4.26667e-07 s

Total time: 10.4917 s  
File: profiler_test.py  
Function: main at line 19

Line #      Hits         Time  Per Hit   % Time  Line Contents  
==============================================================
    19                                           @profile
    20                                           def main():
    21         1      8136920 8136920.0     33.1      d = webdriver.Chrome()
    22
    23         1      5377998 5377998.0     21.9      d.get("https://www.amazon.com")
    24         1      2888695 2888695.0     11.7      d.maximize_window()
    25         1       203765 203765.0      0.8      H = get_all_pages_links(d)
    26       451         4658     10.3      0.0      for h in H:
    27       450       489150   1087.0      2.0          print(h)
    28         1      7488800 7488800.0     30.5      d.quit()

Total time: 0.0868468 s  
File: profiler_test.py  
Function: get_all_pages_links at line 31

Line #      Hits         Time  Per Hit   % Time  Line Contents  
==============================================================
    31                                           @profile
    32                                           def get_all_pages_links(d):
    33         1       203523 203523.0    100.0      hrefs = d.execute_script(JS_GET_ALL_HREFS)
    34         1           24     24.0      0.0      return hrefs

The results speak for themselves. The runtime of get_all_pages_links() went from 6.49 seconds to 4.68 seconds to 0.086 seconds. We improved the time needed to scrape all the page's links by a factor of more than 75. These results demonstrate the power of line profiling and how it can be used to drastically improve runtimes. Testing each type of function in a case by case basis lets developers know exactly which ones will run most efficiently so that they can create faster and more robust automated tests. In this case, the links were able to scraped in an alternative way such that we had to go through one less layer of abstraction. Instead of using the Selenium library within Python to access browser elements, we use JavaScript to access those elements much more quickly, and then simply send the results back to our library for processing. By cutting out the middleman, we save time and computational power.