Gecko is a customer relationship management (CRM) system built on the Salesforce platform. It is used to manage customer data and interactions, as well as automate various business processes like employee’s documents and workflows.
Manually downloading documents from Gecko (Salesforce) to have full employment history of a person can be a time-consuming and error-prone process. This project addresses this challenge by automating the document download process using Python and the Selenium library.
Project Goals
The primary goal of this project is to automate the following tasks:
- Accessing and Navigating Gecko (Salesforce)
- Identifying and Clicking on the “Documents” Tab
- Expanding the “View All” Section
- Iterating through and Extracting Document Information
- Downloading Documents to a Specific Location
Technologies Used
To achieve these goals, the following technologies were employed:
- Python: A versatile and powerful programming language that provides a user-friendly interface for automation tasks.
- Selenium: A widely used web automation framework that enables interaction with web elements and simulates user actions.
import config
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
def print_info(row_key, document_name, link2doc, date_of_document, doc_link):
"""Prints the value of row_key followed by the strings
"Document Name:", document_name, link2doc, and date_of_document
concatenated together. Then prints the string "Link:"
followed by the value of doc_link."""
# Print the extracted information
info = f"{row_key} Document Name: {document_name} - {link2doc} - {date_of_document}"
link = f"Link: {doc_link}"
print(info)
print(link)
print()
return
# The function performs the following steps:
# It opens the link_element by simulating a key press combination (Ctrl + click)
# using the ActionChains class from the selenium library.
# It switches to the newly opened tab.
# It waits for the page to load for 20 seconds.
# It finds and clicks the download button on the page.
# It waits for the download to complete for 20 seconds.
# It closes the current tab.
# It switches back to the original tab.
# If any exception occurs during this process, it prints an error message
# and follows the same steps to close the tab and switch back to the original tab.
def download_document(driver, link_element):
try:
# Open the link_element in a new tab
action_chains = ActionChains(driver)
action_chains.key_down(Keys.CONTROL).click(link_element).key_up(Keys.CONTROL).perform()
### another option
# driver.execute_script("window.open(arguments[0], '_blank')", link_element.get_attribute("href"))
# Switch to the new tab
driver.switch_to.window(driver.window_handles[-1])
# Wait for the page to load
time.sleep(20) # Adjust the sleep duration as needed
# Initiate the download
download_button = driver.find_element(By.XPATH, '//a[@title="Download"]')
download_button.click()
### another option
# driver.find_element(By.CSS_SELECTOR, 'a[title="Download"]').click()
# Wait for the download to complete
time.sleep(20) # Adjust the sleep duration as needed
except Exception as e:
print(f"Error: {e}")
# Close the current tab
driver.close()
# Switch back to the original tab
driver.switch_to.window(driver.window_handles[0])
return
def main() -> None:
# Get the address from the config.ini file
address, login_title, documents_tab = config.read_config()
# Set up the Selenium webdriver
#driver = config.load_chrome_driver()
with config.load_chrome_driver() as driver:
# Navigate to the webpage
driver.get(f"{address}")
# Wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "title")))
# Check if the user is logged in
is_logged_in = False
while not is_logged_in:
# Get the page title
page_title = driver.title
# Check if the page title indicates that the user is signed in
if f"{login_title}" not in page_title:
is_logged_in = True
else:
# Wait for a certain interval before checking again
time.sleep(1)
# Wait for the "Documents" link to be visible
# then
# Find the Documents tab by its label
documents_link = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, f'//a[contains(text(), "{documents_tab}")]')))
# Click on the Documents tab
documents_link.click()
# Wait for the "View All" link to appear and then click on it
try:
view_all_link = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.LINK_TEXT, "View All")))
# Execute JavaScript to click on the "View All" link
driver.execute_script("arguments[0].click();", view_all_link)
except TimeoutException:
print("The 'View All' link did not appear within 1 minute.")
# Wait until the "Hide" link is visible
hide_link = WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Hide']")))
# Find the table rows that represent each item
# Wait for the table rows to be visible
item_rows = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//tr[@data-row-key-value]")))
# print(len(item_rows))
# print(item_rows[1].get_attribute('innerHTML'))
# document_name = item_rows[1].find_element(By.XPATH, './/th[@data-label="Document Type"]')
# print(document_name)
# Loop through each item row and extract the desired information
for item_row in item_rows[1:]:
# Extract the Document Name
document_name = lambda item_row: item_row.find_element(By.XPATH, './/th[@data-label="Document Type"]//lightning-base-formatted-text').text.strip()
# Extract the Link2Doc
link2doc = lambda item_row: item_row.find_element(By.XPATH, './/td[@data-label="Link"]//lightning-formatted-url/a').text.strip()
# Find the <a> element
link_element = item_row.find_element(By.XPATH, './/td[@data-label="Link"]//lightning-formatted-url/a')
# Extract the DateOfDocument
date_of_document = item_row.find_element(By.XPATH, './/td[@data-label="Date Uploaded"]//lightning-formatted-date-time').text.strip()
# Print the extracted information
print_info(item_row.get_attribute('data-row-key-value'), document_name, link2doc, date_of_document, link_element.get_attribute('href'))
# Call download function
download_document(driver,link_element)
# Close the current browser window, but keep the WebDriver session alive
driver.close()
# Close the webdriver
driver.quit()
if __name__ == "__main__":
main()
Project Implementation
The project’s implementation involves the following steps:
- Establishing a Connection: The script opens a browser window and navigates to the Gecko login page.
- Validating User Login: The script verifies that the user is successfully logged in by checking the page title.
- Accessing the Documents Tab: The script identifies the “Documents” tab and clicks on it to display the document list.
- Expanding “View All” Section: If the “View All” section is collapsed, the script expands it to reveal all available documents.
- Extracting Document Information: The script iterates through each document row and extracts relevant information such as Document Name, Link to Document, and Document’s creation Date.
- Downloading Documents: The script uses the extracted link to download the corresponding document to a specified location.
Conclusion
This project showcases the benefits of automation in simplifying document management processes. By using Python and Selenium, the project automates the tasks of accessing, browsing, and downloading documents from Gecko (Salesforce). Afterward, collecting all personal documents of history of employment that are available on Gecko.
This project can be applied to various scenarios requiring automated document download, such as legal, financial, and medical workflows.
The code is publicly available on GitHub: https://github.com/rezabs/gecko-doc-downloader/