stpy package

Subpackages

Submodules

stpy.utils module

This contains some helper functions.

class stpy.utils.FingerprintTransformer(fp='morgan', radius=3, fpSize=2048, dtype='uint8')

Bases: BaseEstimator, TransformerMixin

fit(X, y=None)
transform(X)
class stpy.utils.MoleculeStandardizer(steps=None, largest_fragment=False, numThreads=4, logger=None)

Bases: object

A reusable, configurable RDKit molecule standardization pipeline.

https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/MolStandardize/TransformCatalog/normalizations.in

Parameters:
  • steps (Supported)

  • steps

    • “normalize”

    • ”remove_fragments”

    • ”reionize”

    • ”tautomer_parent”

    • ”stereo_parent”

    • ”isotope_parent”

    • ”charge_parent”

    • ”super_parent”

    • ”cleanup”

  • largest_fragment (bool, optional. Keep only the largest fragment after processing.)

  • numThreads (int, optional. Number of threads for RDKit operations.)

  • logger (logging.Logger, optional. Logger for reporting actions.)

Notes

This class is designed for batch processing and reproducible pipelines.

standardize_mols(mols)

Apply the configured standardization pipeline to RDKit Mol objects.

standardize_smiles(smiles_list)

Standardize a list of SMILES strings.

Returns:

Standardized SMILES strings.

Return type:

list of str

stpy.utils.cas2smiles(cas_number='50-78-2')

Convert CAS number to SMILES using PubChem PUG-REST API. :type cas_number: str :param cas_number: CAS registry number. :type cas_number: str

Returns:

SMILES string or “NotFound” if not found.

Return type:

str

stpy.utils.concat_fingerprints(smiles, fps=('morgan', 'rdkit'), radius=3, fpSize=2048, dtype='uint8', logger=None)

Generate and concatenate multiple fingerprints for a single molecule.

Parameters:
  • smiles (str) – Input SMILES string.

  • fps (tuple of str) – Fingerprint types to concatenate.

  • radius (int) – Morgan radius (used where applicable).

  • fpSize (int) – Bit vector size for fingerprints that support it.

  • dtype (str) – Output dtype: ‘bool’, ‘uint8’, ‘int8’.

Returns:

Concatenated fingerprint vector.

Return type:

np.ndarray or None

stpy.utils.concat_fingerprints_df(df, smiles_col='smiles', fps=('morgan', 'rdkit'), radius=3, fpSize=2048, dtype='uint8', logger=None)

Compute concatenated fingerprints for all SMILES in a DataFrame.

Returns:

Shape: (n_samples, total_fp_length)

Return type:

np.ndarray

stpy.utils.fp_manipulate(df, fp1, fp2, mani='concat')

Concatenate or sum two fingerprint columns in a DataFrame.

Parameters:
  • df (pd.DataFrame) – Input dataframe.

  • fp1 (str) – First fingerprint column name.

  • fp2 (str) – Second fingerprint column name.

  • mani (str) – ‘concat’ or ‘sum’.

Returns:

DataFrame with new column ‘{mani}_fp’.

Return type:

pd.DataFrame

stpy.utils.get_fingerprint(smiles, fp='morgan', radius=3, fpSize=2048, output='numpy', dtype='uint8', logger=None)

Memory‑optimized molecular fingerprint generator.

Parameters:
  • smiles (str) – Input SMILES string.

  • fp (str) – Fingerprint type.

  • output (str) – ‘numpy’ (recommended) or ‘vect’.

  • dtype (str) – Data type for NumPy output: ‘bool’, ‘uint8’, or ‘int8’.

Return type:

np.ndarray or ExplicitBitVect or None

stpy.utils.get_fingerprints_df(df, smiles_col='smiles', fp='morgan', radius=3, fpSize=2048, dtype='uint8')

Compute fingerprints for all SMILES in a DataFrame. Returns a new column ‘{fp}_fp’ with fingerprint arrays. :type df: pd.DataFrame :param df: Input dataframe. :type df: pd.DataFrame :type smiles_col: str :param smiles_col: Column name containing SMILES strings. :type smiles_col: str :type fp: str :param fp: Fingerprint type. :type fp: str :type radius: int :param radius: Morgan radius (used where applicable). :type radius: int :type fpSize: int :param fpSize: Bit vector size for fingerprints that support it. :type fpSize: int :type dtype: str :param dtype: Output dtype: ‘bool’, ‘uint8’, ‘int8’. :type dtype: str

Returns:

DataFrame with new column ‘{fp}_fp’.

Return type:

pd.DataFrame

stpy.utils.molecule_standardize(smiles, steps=None, largest_fragment=False, numThreads=4, logger=None)

Convenience wrapper around MoleculeStandardizer.

stpy.utils.safe_canonicalsmi_from_smiles(smi)

Safely generate canonical SMILES from input SMILES string.

Parameters:

smi (string) – SMILES string.

Returns:

Canonical SMILES string.

Return type:

string

Examples

>>> smiles = 'C1=CC=CC=C1OCOC'
>>> canon_smi = safe_canonicalsmi_from_smiles(smiles)
>>> print(canon_smi)
COC1=CC=CC=C1O
>>> a =['COCCCN', 'c1ccccc1OCOC', None, 'C1CCCCC1O', 'C1=CC=CC=C1', 'invalid_smiles']
>>> df = pd.DataFrame({'smiles': a})
>>> df['canonical_smi'] = df['smiles'].apply(safe_canonicalsmi_from_smiles)
>>> print(df)
smiles canonical_smi
0          COCCCN        COCCCN
1    c1ccccc1OCOC  COCOc1ccccc1
2            None          None
3       C1CCCCC1O     OC1CCCCC1
4     C1=CC=CC=C1      c1ccccc1
5  invalid_smiles          None

stpy.summarize_gScholarAlerts module

This scripts is aiming at automating the process of summarizing the publications from Google Scholar Alerts.

It connects to a Gmail account, fetches all/unread emails from the “gScholarAlerts” label for the last 7 days, extracts publication details, and saves the summary to a CSV file. It also retrieves additional information from the DOI and URL of the publications. The script is designed to work with Google Scholar Alerts, which send notifications about new publications based on user-defined search queries.

The script uses the following libraries: - imaplib: For connecting to the Gmail IMAP server and fetching emails. - email: For parsing the email content. - pandas: For data manipulation and saving to CSV. - BeautifulSoup: For parsing HTML content. - requests: For making HTTP requests to fetch publication details from DOI and URL. - fitz (PyMuPDF): For extracting text from PDF files. The script is designed to be run as a standalone program, and it requires the following environment variables to be set: - GMAIL_USERNAME: The Gmail username (email address). - GMAIL_PASSWORD: The Gmail password (or app password if 2FA is enabled). It is recommended to use an app password for security reasons if 2FA is enabled on the Gmail account.

stpy.summarize_gScholarAlerts.extract_publication_details(html_content)

Extract publication details from the HTML content of the email.

Parameters:

html_content (str) – The HTML content of the email.

Returns:

A list of dictionaries containing publication details.

Return type:

list

stpy.summarize_gScholarAlerts.extract_publication_info(url)

Extract publication details based on whether the URL is a PDF or a web page. :type url: str :param url: The URL of the publication. :type url: str

Returns:

A dictionary containing publication details.

Return type:

dict

stpy.summarize_gScholarAlerts.extract_text_from_pdf(pdf_url)

Download a PDF from a URL and extract its text without saving it locally. :type pdf_url: str :param pdf_url: The URL of the PDF file. :type pdf_url: str

Returns:

Extracted text from the PDF.

Return type:

str

stpy.summarize_gScholarAlerts.extract_text_from_webpage(web_url)

Extract text from an HTML webpage. :type web_url: str :param web_url: The URL of the webpage. :type web_url: str

Returns:

A dictionary containing the abstract text.

Return type:

dict

stpy.summarize_gScholarAlerts.fetch_unread_emails(username, password, since_days, label='gScholarAlerts')

Fetch_unread_emails from Gmail with specific label and since certain days

Parameters:
  • username (str) – The Gmail username (email address).

  • password (str) – The Gmail password (or app password if 2FA is enabled).

  • since_days (date) – The number of days to look back for emails.

  • label (str, optional) – the label to fetch emails from, that you have set in your Gmail account. Defaults to “gScholarAlerts”.

Returns:

Fetched mail object and email IDs.

Return type:

mail, email_ids (set)

stpy.summarize_gScholarAlerts.get_more_info(publications)

Get more information from DOI and URL.

Parameters:

publications (list) – A list of dictionaries containing publication details.

Returns:

A DataFrame with additional information from DOI and URL.

Return type:

pd.DataFrame

stpy.summarize_gScholarAlerts.get_publication_details(doi)

Retrieve publication details including abstract using DOI from the CrossRef API. :type doi: str :param doi: The DOI of the publication. :type doi: str

Returns:

A dictionary containing publication details.

Return type:

dict

stpy.summarize_gScholarAlerts.is_pdf_url(url)

Check if the URL points to a PDF file by examining headers. :type url: str :param url: The URL to check. :type url: str

Returns:

True if the URL points to a PDF, False otherwise.

Return type:

bool

stpy.summarize_gScholarAlerts.parse_email(mail, email_id)

Parse fetched emails.

Parameters:
  • mail (object) – Fetched mail object.

  • email_id (object) – Fetched email ID.

Returns:

parsed email content.

Return type:

str

stpy.summarize_gScholarAlerts.summarize_publications(username, password, since_days)

Summarize publications from Google Scholar Alerts and save to CSV file.

Parameters:
  • username (str) – The Gmail username (email address).

  • password (str) – The Gmail password (or app password if 2FA is enabled).

  • since_days (date) – The number of days to look back for emails.

Returns:

None

Module contents