stpy package
Subpackages
Submodules
stpy.utils module
This contains some helper functions.
- class stpy.utils.FingerprintTransformer(fp='morgan', radius=3, fpSize=2048, dtype='uint8')
Bases:
BaseEstimator,TransformerMixin- fit(X, y=None)
- transform(X)
- class stpy.utils.MoleculeStandardizer(steps=None, largest_fragment=False, numThreads=4, logger=None)
Bases:
objectA reusable, configurable RDKit molecule standardization pipeline.
- Parameters:
steps (Supported)
steps –
“normalize”
”remove_fragments”
”reionize”
”tautomer_parent”
”stereo_parent”
”isotope_parent”
”charge_parent”
”super_parent”
”cleanup”
largest_fragment (bool, optional. Keep only the largest fragment after processing.)
numThreads (int, optional. Number of threads for RDKit operations.)
logger (logging.Logger, optional. Logger for reporting actions.)
Notes
This class is designed for batch processing and reproducible pipelines.
- standardize_mols(mols)
Apply the configured standardization pipeline to RDKit Mol objects.
- standardize_smiles(smiles_list)
Standardize a list of SMILES strings.
- Returns:
Standardized SMILES strings.
- Return type:
list of str
- stpy.utils.cas2smiles(cas_number='50-78-2')
Convert CAS number to SMILES using PubChem PUG-REST API. :type cas_number: str :param cas_number: CAS registry number. :type cas_number: str
- Returns:
SMILES string or “NotFound” if not found.
- Return type:
str
- stpy.utils.concat_fingerprints(smiles, fps=('morgan', 'rdkit'), radius=3, fpSize=2048, dtype='uint8', logger=None)
Generate and concatenate multiple fingerprints for a single molecule.
- Parameters:
smiles (str) – Input SMILES string.
fps (tuple of str) – Fingerprint types to concatenate.
radius (int) – Morgan radius (used where applicable).
fpSize (int) – Bit vector size for fingerprints that support it.
dtype (str) – Output dtype: ‘bool’, ‘uint8’, ‘int8’.
- Returns:
Concatenated fingerprint vector.
- Return type:
np.ndarray or None
- stpy.utils.concat_fingerprints_df(df, smiles_col='smiles', fps=('morgan', 'rdkit'), radius=3, fpSize=2048, dtype='uint8', logger=None)
Compute concatenated fingerprints for all SMILES in a DataFrame.
- Returns:
Shape: (n_samples, total_fp_length)
- Return type:
np.ndarray
- stpy.utils.fp_manipulate(df, fp1, fp2, mani='concat')
Concatenate or sum two fingerprint columns in a DataFrame.
- Parameters:
df (pd.DataFrame) – Input dataframe.
fp1 (str) – First fingerprint column name.
fp2 (str) – Second fingerprint column name.
mani (str) – ‘concat’ or ‘sum’.
- Returns:
DataFrame with new column ‘{mani}_fp’.
- Return type:
pd.DataFrame
- stpy.utils.get_fingerprint(smiles, fp='morgan', radius=3, fpSize=2048, output='numpy', dtype='uint8', logger=None)
Memory‑optimized molecular fingerprint generator.
- Parameters:
smiles (str) – Input SMILES string.
fp (str) – Fingerprint type.
output (str) – ‘numpy’ (recommended) or ‘vect’.
dtype (str) – Data type for NumPy output: ‘bool’, ‘uint8’, or ‘int8’.
- Return type:
np.ndarray or ExplicitBitVect or None
- stpy.utils.get_fingerprints_df(df, smiles_col='smiles', fp='morgan', radius=3, fpSize=2048, dtype='uint8')
Compute fingerprints for all SMILES in a DataFrame. Returns a new column ‘{fp}_fp’ with fingerprint arrays. :type df: pd.DataFrame :param df: Input dataframe. :type df: pd.DataFrame :type smiles_col: str :param smiles_col: Column name containing SMILES strings. :type smiles_col: str :type fp: str :param fp: Fingerprint type. :type fp: str :type radius: int :param radius: Morgan radius (used where applicable). :type radius: int :type fpSize: int :param fpSize: Bit vector size for fingerprints that support it. :type fpSize: int :type dtype: str :param dtype: Output dtype: ‘bool’, ‘uint8’, ‘int8’. :type dtype: str
- Returns:
DataFrame with new column ‘{fp}_fp’.
- Return type:
pd.DataFrame
- stpy.utils.molecule_standardize(smiles, steps=None, largest_fragment=False, numThreads=4, logger=None)
Convenience wrapper around MoleculeStandardizer.
- stpy.utils.safe_canonicalsmi_from_smiles(smi)
Safely generate canonical SMILES from input SMILES string.
- Parameters:
smi (string) – SMILES string.
- Returns:
Canonical SMILES string.
- Return type:
string
Examples
>>> smiles = 'C1=CC=CC=C1OCOC' >>> canon_smi = safe_canonicalsmi_from_smiles(smiles) >>> print(canon_smi) COC1=CC=CC=C1O
>>> a =['COCCCN', 'c1ccccc1OCOC', None, 'C1CCCCC1O', 'C1=CC=CC=C1', 'invalid_smiles'] >>> df = pd.DataFrame({'smiles': a}) >>> df['canonical_smi'] = df['smiles'].apply(safe_canonicalsmi_from_smiles) >>> print(df) smiles canonical_smi 0 COCCCN COCCCN 1 c1ccccc1OCOC COCOc1ccccc1 2 None None 3 C1CCCCC1O OC1CCCCC1 4 C1=CC=CC=C1 c1ccccc1 5 invalid_smiles None
stpy.summarize_gScholarAlerts module
This scripts is aiming at automating the process of summarizing the publications from Google Scholar Alerts.
It connects to a Gmail account, fetches all/unread emails from the “gScholarAlerts” label for the last 7 days, extracts publication details, and saves the summary to a CSV file. It also retrieves additional information from the DOI and URL of the publications. The script is designed to work with Google Scholar Alerts, which send notifications about new publications based on user-defined search queries.
The script uses the following libraries: - imaplib: For connecting to the Gmail IMAP server and fetching emails. - email: For parsing the email content. - pandas: For data manipulation and saving to CSV. - BeautifulSoup: For parsing HTML content. - requests: For making HTTP requests to fetch publication details from DOI and URL. - fitz (PyMuPDF): For extracting text from PDF files. The script is designed to be run as a standalone program, and it requires the following environment variables to be set: - GMAIL_USERNAME: The Gmail username (email address). - GMAIL_PASSWORD: The Gmail password (or app password if 2FA is enabled). It is recommended to use an app password for security reasons if 2FA is enabled on the Gmail account.
- stpy.summarize_gScholarAlerts.extract_publication_details(html_content)
Extract publication details from the HTML content of the email.
- Parameters:
html_content (str) – The HTML content of the email.
- Returns:
A list of dictionaries containing publication details.
- Return type:
list
- stpy.summarize_gScholarAlerts.extract_publication_info(url)
Extract publication details based on whether the URL is a PDF or a web page. :type url: str :param url: The URL of the publication. :type url: str
- Returns:
A dictionary containing publication details.
- Return type:
dict
- stpy.summarize_gScholarAlerts.extract_text_from_pdf(pdf_url)
Download a PDF from a URL and extract its text without saving it locally. :type pdf_url: str :param pdf_url: The URL of the PDF file. :type pdf_url: str
- Returns:
Extracted text from the PDF.
- Return type:
str
- stpy.summarize_gScholarAlerts.extract_text_from_webpage(web_url)
Extract text from an HTML webpage. :type web_url: str :param web_url: The URL of the webpage. :type web_url: str
- Returns:
A dictionary containing the abstract text.
- Return type:
dict
- stpy.summarize_gScholarAlerts.fetch_unread_emails(username, password, since_days, label='gScholarAlerts')
Fetch_unread_emails from Gmail with specific label and since certain days
- Parameters:
username (str) – The Gmail username (email address).
password (str) – The Gmail password (or app password if 2FA is enabled).
since_days (date) – The number of days to look back for emails.
label (str, optional) – the label to fetch emails from, that you have set in your Gmail account. Defaults to “gScholarAlerts”.
- Returns:
Fetched mail object and email IDs.
- Return type:
mail, email_ids (set)
- stpy.summarize_gScholarAlerts.get_more_info(publications)
Get more information from DOI and URL.
- Parameters:
publications (list) – A list of dictionaries containing publication details.
- Returns:
A DataFrame with additional information from DOI and URL.
- Return type:
pd.DataFrame
- stpy.summarize_gScholarAlerts.get_publication_details(doi)
Retrieve publication details including abstract using DOI from the CrossRef API. :type doi: str :param doi: The DOI of the publication. :type doi: str
- Returns:
A dictionary containing publication details.
- Return type:
dict
- stpy.summarize_gScholarAlerts.is_pdf_url(url)
Check if the URL points to a PDF file by examining headers. :type url: str :param url: The URL to check. :type url: str
- Returns:
True if the URL points to a PDF, False otherwise.
- Return type:
bool
- stpy.summarize_gScholarAlerts.parse_email(mail, email_id)
Parse fetched emails.
- Parameters:
mail (object) – Fetched mail object.
email_id (object) – Fetched email ID.
- Returns:
parsed email content.
- Return type:
str
- stpy.summarize_gScholarAlerts.summarize_publications(username, password, since_days)
Summarize publications from Google Scholar Alerts and save to CSV file.
- Parameters:
username (str) – The Gmail username (email address).
password (str) – The Gmail password (or app password if 2FA is enabled).
since_days (date) – The number of days to look back for emails.
- Returns:
None