第一次合并clean代码
This commit is contained in:
@@ -1,257 +0,0 @@
|
||||
# A hierarchically porous MOF confined CsPbBr3 quantum dots: Fluorescence switching probe for detecting Cu (II) and melamine in food samples
|
||||
|
||||
Shahnaz Ahmed a, Suman Lahkar a, Simanta Doley b, Dambarudhar Mohanta c, Swapan Kumar Dolui a,\*
|
||||
|
||||
a Department of Chemical Sciences, Tezpur University, Napaam, Tezpur, Assam 784028, India b Jengraimukh College, Majuli, Assam 784028, India c Department of Physics, Tezpur University, Napaam, Tezpur, Assam 784028, India
|
||||
|
||||
# A R T I C L E I N F O
|
||||
|
||||
# A B S T R A C T
|
||||
|
||||
Keywords:
|
||||
CsPbBr3/HZIF-8
|
||||
Metal organic framework
|
||||
Stability
|
||||
Cu (II)
|
||||
Melamine
|
||||
Fluorescence sensor
|
||||
|
||||
The hypersensitivity of Perovskite quantum dots (PeQDs) towards external environmental conditions and the water quenching of their fluorescence have limited their practical analytical applications. In this work highly luminescent $\mathsf{C s P b B r}_{3}$ was confined into a hierarchically porous ZIF-8 metal organic framework (HZIF-8) through a simple two step in situ growth method. XRD, FTIR, SEM, TEM, and XPS investigations confirmed the successful synthesis of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ nanocomposite. $\mathsf{C s P b B r}_{3}$ PeQDs are uniformly distributed within the HZIF-8 MOF matrix and exhibited intense green emission at $510~\mathrm{nm}$ with a FWHM value of $25~\mathrm{nm}$ in ambient condi tions. The nanocomposite showed enhanced stability against moisture and UV light. The intense PL emission of the nanocomposite was well maintained in an aqueous solution also, with no noticeable change in fluorescence characteristics for 15 days of storage in aqueous solution. The $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite was then utilized as an on–off-on luminescent probe for the detection of $\mathrm{{Cu}}^{2+}$ metal ions and melamine in real samples. Because of the porous nature of the MOF, the target analytes can be quickly located and this nanosensor is found to be very sensitive toward $\mathrm{{Cu}}^{2+}$ and melamine detection. A linear relationship from 3 to $500~\mathrm{nM}$ and $30{-}1500~\mathrm{nM}$ was found for melamine and $\mathrm{{Cu}}^{2+}$ under an optimal experimental condition, with a limit of detection value of around $4.66~\mathrm{nM}$ and $2.64~\mathrm{nM}$ for $\mathrm{{Cu}}^{2+}$ and melamine respectively.
|
||||
|
||||
# 1. Introduction
|
||||
|
||||
With the continued growth of the world population, the demand for food safety and environmental pollution is expected to rise, and it is quickly becoming a global concern that earns widespread attention. Simultaneously, heavy metal ion pollutants can easily accumulate in the human body. They are not only a dangerous hazard to the environment but also constitute a serious health risk to humans. Particularly, the $\mathrm{{Cu}}^{2+}$ ion, which is involved in many enzyme activities, has a considerable impact on both human health and plants. Excessive consumption of $\mathrm{{Cu}}^{\bar{2}+}$ will disrupt homeostasis, leading to a number of risks including Wilson’s and Alzheimer’s disease, among others. Also, industrial wastewater contains Cu as heavy metal ion waste. Therefore, it is important to quantify the $\mathrm{{Cu}}^{2+}$ ions in environmental and biological systems using simple and sensitive methods [1-3].
|
||||
|
||||
$_{2,4,6}$ -triamino-1,3,5-triazine (Melamine) is a nitrogen rich organic base and an industrial raw material, generally used in the production of melamine–formaldehyde resins for plastic and paint industries, adhe sives, coatings, and fire-retardant materials, etc. However, melamine has received major attention in the aspects of food safety following a crisis concerning melamine-contaminated milk in 2008 [4]. Owing to its high nitrogen content, melamine is illegally added to dairy products, and feedstuffs to enhance the protein content and thus misguided the consumers. Cyanuric acid, the hydrolysis product of melamine, has generated a less soluble melamine cyanurate complex and eventually precipitates in the renal tubules, causing further tissue injury and uri nary calculus [5]. Consuming melamine-containing food over an extended period of time at levels above the safety limit $(20~\upmu\mathrm{M}$ in the USA and EU, $8~{\upmu\mathrm{M}}$ for infant formula in China) can possibly be fatal, especially in infants and children [6]. Therefore, it is crucial to develop an efficient detection technique for traces of melamine in the food and feed industries to prevent adulteration and protect food safety. Advanced analytical techniques such as high-performance liquid chro matography (HPLC) [7], liquid chromatography-mass spectrometry (LC
|
||||
|
||||
MS) [8], electrochemical methods [9], colorimetry [10], surfaceenhanced Raman spectroscopy [11], enzyme-linked immunosorbent assays [12], Fourier transform infrared spectroscopy (FT-IR), and fluo rescence spectroscopy have been used to develop potential assays for melamine detection [4]. Amongst them, the fluorimetric method pro vides a simple, low-cost, highly selective, and sensitive method for the successful detection of melamine. Smart fluorescent materials that can alter their physical parameters when exposed to an external stimulus have caught the attention of researchers due to their potential applica tions in a variety of disciplines such as sensing, anti-counterfeiting, bioimaging, etc. [13-17]. Most of the luminescent sensing applications are based on organic fluorophores and dyes, polymer nanocomposite, metal oxide nanoparticles, metal chalcogenide quantum dots, etc. [18- 21]. However, these materials possess limitations including weak PL intensity, low photoluminescence quantum yield (PLQY) value, low color purity, photo bleaching, and wide emission line limiting sensitivity of detection. Thus, the introduction of a new fluorescent nanomaterial with enhanced luminescent performance remains essential.
|
||||
|
||||
With the emergence of nanotechnology, metal halide perovskite nano crystals (PeNCs) with very high PLQY, narrow and intense emis sion, tunable emission and absorption over the entire visible spectrum, long carrier lifetime have attracted massive attention, making them a promising material in lighting and display technology, photovoltaic solar cells, photodetectors, sensors along with many others [22-24]. Inorganic $\mathsf{C s P b X}_{3}$ $X=\mathbf{B}\mathbf{r}$ , Cl, I) PeNCs with the conventional perovskite architecture have recently demonstrated great potential as chemical sensor materials [25-28] owing to their high fluorescence characteris tics. But the most of the analytical applications of $\mathrm{CsPbX}_{3}$ PeNCs are limited to non-aqueous medium, thereby hindering practical applica tions. The relatively low lattice formation energy and the ionic nature of perovskite crystals result in poor stability leading to a loss of PL intensity under extreme environmental conditions (light, humidity, water, and high temperature). This concern has sparked a lot of research interest in $\mathsf{C s P b X}_{3}$ PeNC stabilization. Numerous efforts have been made to stabi lize perovskite quantum dots (PeQDs) either through surface passivation or encapsulation in an inorganic matrix $\mathrm{{SiO}_{2}}$ , zeolite, $\mathrm{{Al}}_{2}0_{3.}$ , $\mathrm{TiO}_{2})$ [29- 32]. Among these conventional inorganic materials, metal organic frameworks (MOFs) have earned a significant attention due to their high porosity, functional diversity, tunable pore size, structural adjustability, and very high specific surface area [33]. It has been shown that encapsulating $\mathsf{C s P b X}_{3}$ PeNCs in the MOF matrices to create host–guest composites is an easy and effective approach [34-38]. Additionally, the strong host–guest interactions and the MOFs’ substantial surface area, stabilize the nano confined guest species without the need for additional stabilizing agents. The pore diameters of frequently employed MOF hosts are in the microporous region, which is unfavorable for reactant access due to the substantial diffusion resistance and resulted into a limited loading of PeNCs [34,38]. In recent years, MOFs have been increasingly utilized as chemical sensors [39,40], but their low PLQY originating from either poor intrinsic luminescence of inorganic or organic components, or luminescence resulting from the weak metal to ligand charge transfer, limit the applications of these MOFs as sensor [41]. Furthermore, to improve the luminescence properties in MOFs lanthanide metal ions are introduced along with complex organic linkers, which are difficult to produce via synthesis. Therefore, utilizing the ultrahigh porosity of MOF as host matrices for various luminescent guest molecules with high PLQY has been alternatively designed to produce functionalized luminescent MOF based nanocomposite [42,43].
|
||||
|
||||
Zeolitic imidazolate framework (ZIF-8) with superior chemical sta bility having zinc as the metal source and 2-methyl imidazole as the organic linker is a subclass of microporous MOF. ZIF-8 MOF thus ex hibits pore size lies in microporous region i.e. $<2~\mathrm{nm}$ . As mentioned earlier, the microporous system has limited diffusion of guest molecules, thereby hierarchically porous ZIF-8 (HZIF-8) has great interest. In HZIF8, mesopores are introduced in the ZIF-8 that results in a mixture type of both micropores and mesopores in ZIF-8 that increase the diffusion rate of reactants. To increase the material’s stability, $\mathsf{C s P b X}_{3}$ QDs can be enclosed in HZIF-8.
|
||||
|
||||
In an effort to widen the analytical applications of PeNCs, our present work introduces a confined synthesis of $\mathsf{C s P b B r}_{3}$ PeNCs within the hi erarchically porous MOF (HZIF-8) via an in situ growth method. The method uses HZIF-8 as support to grow PeQDs directly within the MOF system with uniform crystal size and highly luminescent properties. HZIF-8 was synthesized following a triethylamine-assisted method. The stability of PeNCs within the MOF is highly improved and maintains the bright PL in harsh environmental conditions (moisture, water, temper ature, light, etc.). Functionalization of $\mathsf{C s P b B r}_{3}$ quantum dots (QDs) with the HZIF-8 MOF host combines the advantages of both the unique and outstanding luminescence properties of perovskite nanocrystals and the ability of efficient accumulation and adsorption of target analytes by the MOF matrix enabling them to be effective sensing probe and thereby demonstrating great selectivity and sensitivity. Therefore, $\mathsf{C s P b B r}_{3}/\mathrm{H}.$ ZIF8 composites were utilized as fluorescence turn-off–on sensors for the quantitative detection of copper metal ions and melamine in food samples with a very low limit of detection (LOD) value of $4.66~\mathrm{nM}$ and $13.84~\mathrm{nM}$ respectively. Here, the metal ion $(\mathrm{Cu}^{2+})$ acts as a quenching agent for the green PL signal of $\mathsf{C s P b B r_{3}/H}$ -ZIF8 composite. Melamine as a multifunctional system can competitively adsorb $\mathrm{{Cu}}^{2+}$ from the sur face of the sensing probe due to the strong interaction between mel amine nitrogen and Cu and the quenched PL signal is restored and thus develops an on–off–on fluorescence sensor.
|
||||
|
||||
# 2. Experimental section
|
||||
|
||||
# 2.1. Chemicals
|
||||
|
||||
Lead bromide $(\mathrm{PbBr}_{2}$ , $99.9\%$ , alfa aesar), Cesium bromide (CsBr, $99.9\%$ , alfa aesar), oleic acid (OA, $90\%$ , alfa aesar), oleylamine (OAm, $99\%$ , alfa aesar), zinc nitrate hexahydrate $\mathrm{(Zn(NO_{3})_{2}{\cdot}6H_{2}O}$ , $99\%$ , alfa aesar), 2-methyl imidazole (HmIM), $(99\%$ , SRL chemicals), methanol (MeOH, $99.5\%$ , SRL chemicals) N,N-dimethylformamide (DMF, $99\%$ , Merck), toluene $\mathrm{(C}_{7}\mathrm{H}_{8}$ , Merck), triethyl amine (TEA), melamine $\left({\mathsf{C}}_{3}{\mathsf{H}}_{6}{\mathsf{N}}_{6}\right)$ , $99\%$ , alfa aesar). Ultrapure water was used for all purposes. The chemicals in this work were purchased from commercial sources and used exactly as received.
|
||||
|
||||
# 2.2. Synthesis procedure
|
||||
|
||||
# 2.2.1. Synthesis of HZIF-8:
|
||||
|
||||
HZIF-8 was synthesized according to a previously used method with minor modifications [44]. In this process, $0.8\,\mathrm{~ml~}$ of $\mathrm{Zn(NO_{3})_{2}{\cdot}6H_{2}O}$ solution $(0.8\mathrm{\mmol})$ in deionized water was first mixed with $0.10~\mathrm{ml}$ $(0.70\ \mathrm{mmol})$ of TEA, and then $2.3\;\mathrm{ml}$ of the HmIM solution was added $\left(6.4\,\mathrm{\mmol}\right)$ . The final molar ratio of metal to linker was 1:8. Using deionized water, the reaction volume was filled to a total of $28\,\mathrm{ml}$ . After a continuous stirring of $30\ \mathrm{min}$ at room temperature, the white pre cipitates were formed which were collected using centrifugation. The obtained products were washed several times with water/ethanol mixture and dried overnight at $70~^{\circ}\mathrm{C}$ in a vacuum oven.
|
||||
|
||||
# 2.2.2. $C s P b B r_{3}/H Z I F{-}8$ composite:
|
||||
|
||||
In a typical synthesis procedure of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite, the HZIF-8 MOF $.150\,\mathrm{mg})$ was first dispersed in $10\,\mathrm{ml}$ DMF. Then $2\,\mathrm{mmol}$ of $\mathrm{PbBr}_{2}$ was added into the dispersion with continuous stirring for $^{5\,\mathrm{h}}$ . The $\mathrm{PbBr}_{2}@\mathrm{MOF}$ powder was collected by filtering the solution. In the sec ond step, the resulting powder was dispersed in toluene $\left\langle10\ \mathrm{ml}\right\rangle$ with stirring. Then, the CsBr/methanol solution $\left(1.0\,\mathrm{\mmol}\right)$ was quickly injected into the toluene dispersion that induces the crystallization of perovskite quantum dots and finally produced perovskite $\ @{\bf M O F}$ com posite. The whole experiment was performed at room temperature in an ambient atmosphere. The yellowish colored precipitates were collected by filtration and washed thoroughly with methanol.
|
||||
|
||||
# 2.2.3. Sample preparations and fluorescence measurements This part is included in the supporting information file.
|
||||
|
||||
# 2.3. Instrumentations
|
||||
|
||||
X-ray powder diffraction (PXRD) patterns were recorded on D8 Focus and MINIFIEX (Bruker AXS, Germany) instrument operated at 40 kV and $40\;\mathrm{mA}$ using $\mathrm{Cu\,K0}$ radiation $\left(\lambda={\dot{1}}.5406\;{\mathring{\mathrm{A}}}\right)$ . Fourier transform infra-red (FTIR) spectra of the sample were carried out using a Nicolet Impact-410 IR spectrometer instrument. UV–Vis absorption spectra and photoluminescence spectra of the samples were performed at room temperature using a Shimadzu UV-2450 and a Hitachi F-2700 fluores cence spectrophotometer. The relative PLQY of the composite was measured using fluorescein as the reference standard (Quantum Yield $=$ 0.95 in 0.1 M NaOH). Time resolved PL decay (TRPL) measurement was performed on a Lifespec II pico second time resolved fluorimeter in strument. The surface morphologies of the samples were investigated using scanning electron microscopy (SEM, Jeol JSM 6390LV) images provided with the energy dispersive X-ray spectroscopy (EDX). X-ray photoelectron spectroscopy (XPS) was conducted on ESCALAB 220 XL spectrometer. A TECNAI G2 20 S-TWIN (200KV) was used to capture the transmission electron microscopy (TEM) and high resolution trans mission electron microscopy (HRTEM) images of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ hybrid composite. Quantachrome instrument (version 5.21) was used to obtain $\Nu_{2}$ adsorption–desorption isotherm and pore size distribution graph at $77\,\mathrm{K}.$ .
|
||||
|
||||
# 3. Results and discussion
|
||||
|
||||
The hierarchically porous MOF, HZIF-8 was synthesized by a tem plate free simple triethylamine assisted method, in which free meso pores are generated in ZIF-8 without removing the template. The as synthesized HZIF-8 was utilized to embed the $\mathsf{C s P b B r}_{3}$ (CPB) PeNCs by a surfactant free two-step approach. The $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ nanohybrids were developed, as illustrated in Scheme 1, to create a high-performance sensing probe. All the synthesized materials were characterized with various analytical tools.
|
||||
|
||||
# 3.1. Structural and morphological description of MOF and the composite material
|
||||
|
||||
The highly crystalline structure of the synthesized ZIF-8 and HZIF-8 MOF powder was identified using powder X-ray diffraction (XRD), which is shown in Fig. 1d. The diffraction peaks were located at around $26=7.60$ , 10.57, 12.85, 14.8, 16.6, 18.12, 19.70, 24.6, 25.75, 26.78, 29.74, 30.66 and 32.36, corresponding to the (011), (002), (112), (022), (013), (222), (114), (233), (224), (134), (044), (244) and (235) planes of the ZIF-8 MOF [45]. The distinctive peaks of the PXRD pattern of HZIF-8 are in good agreement with those of the ZIF-8 pattern.
|
||||
|
||||
FTIR spectra presented in Fig. S1b for HZIF-8 and $\mathrm{CPB}@\mathrm{HZIF}.8$ display bands at around $3190~\mathrm{cm}^{-1}$ and $2933~\mathrm{cm}^{-1}$ , assigning to the stretching vibrations of $\mathsf{C}-\mathsf{H}$ from the methyl group of the ZIF-8. All the peaks in the region of $500{-}1600\ \mathrm{cm}^{-1}$ are resulting from the stretching and bending vibration of the imidazolate ring in ZIF-8 [44]. The $\mathsf{C}-\mathsf{N}$ absorption band of HZIF-8 shows strong peaks in the region 1100–1400 $\mathrm{cm}^{-1}$ . The characteristic $Z\mathsf{n}\!-\!\mathrm{N}$ vibration bonding was assigned a peak at $421~\mathrm{cm}^{-1}$ , observed for both HZIF-8 and $\mathrm{CPB@HZIF-8}$ , indicating the coordination of $z\mathfrak{n}^{2+}$ with the 2-methylimidazole ligand. MOF inte grated with PeQDs was synthesized without a surfactant, the organic components in ZIF-8 were primarily responsible for the vibrational peaks of the $\mathsf{C P B@H Z I F-8}$ . The FTIR spectrum of $\mathsf{C P B@H Z I F-8}$ preserves the specific vibrations of MOF, indicating that the MOF matrix suc cessfully passivated the PeQDs.
|
||||
|
||||
The occupancy of $\mathsf{C s P b B r}_{3}$ in the MOF matrix can further be confirmed by the XPS and EDX analysis. EDX spectra of the composite show the presence of Cs, Pb, and the signal of Br along with Zn and N demonstrating the successful formation of $\mathsf{C s P b B r}_{3}$ in the HZIF-8 MOF matrix (Fig. S2). XPS analysis was carried out to determine the surface characteristics and chemical state of the CsPbBr3/H-ZIF8 MOF com posite. As illustrated in the survey XPS spectrum (Fig. 2a), the presence of desired signals of Cs, Pb, and Br on the surface of the $\mathtt{P e Q D}\textcircled{a}\mathtt{M O F}$ composite coupled with the C, N, and $Z\mathfrak{n}^{2+}$ signal from the HZIF-8 matrix, clearly validate the formation of $\mathsf{C s P b B r}_{3}$ nanocrystal in the MOF matrix. The XPS fine spectra displays feature peaks of Cs $3\mathrm{d}_{5/2}$ $(724.2\;\mathrm{eV})$ and Cs $3\mathrm{d}_{3/2}$ $(738.2\;\mathrm{eV})$ , Pb $4\ensuremath{\mathrm{f}_{7/2}}$ $(138.1\;\mathrm{eV})$ , Pb $4\ensuremath{\mathrm{f}_{5/2}}$ (143 eV) and Br $3\mathrm{d}_{5/2}$ $(67.9\,\mathrm{{eV})}$ , Br $3\mathrm{d}_{3/2}$ $.69.1\;\mathrm{eV})$ . These values are closely matched with the previous literature reported for $\mathsf{C s P b B r}_{3}$ PeQD [46]. It should be noted that the binding energies of Cs, $\mathbf{P}\mathbf{b}_{:}$ , and Br are not significantly affected by the insertion of $\mathsf{C s P b B r}_{3}$ PeQD in the MOF. Aside from the PeQDs, $\mathsf{Z n}~2\mathsf{p}$ peaks appeared at binding energies of $1022\,{\mathrm{eV}}$ and 1045.1 eV (Fig. 2e) and were derived from Zn $2\mathrm{P}_{3/2}$ and Zn $2\mathrm{P}_{1/2}$ states respectively from the HZIF-8. Fig. S1a shows the highresolution XPS spectrum of C1s, which further can be deconvoluted into two peaks at 284.6 eV (C-H/ $\mathsf C=\mathsf C)$ and $285.2~\mathrm{eV}$ (C-N), both of which are from imidazole ring. Similarly, the N1s spectrum could be fitted into four peaks centered at $398.6\,{\mathrm{eV}}$ , 399.6 eV, $400.3\,{\mathrm{eV}}$ , and 403 eV. The peak at $398.6\,{\mathrm{eV}}$ corresponding to N-Zn coordinate bond and the peaks at $399.6~\mathrm{eV}$ , $400.3~\mathrm{eV}$ mainly contributed to the N-C and N-H moieties (Fig. 2f). Higher energy N1s peak at $403~\mathrm{eV}$ assigned to the quaternary nitrogen ion which indicates the existence of N-Pb interac tion through non coordinated nitrogen from 2-methyl imidazole linker, thereby signifying a close interaction between PeQDs and the MOF matrix. The XPS analysis results strongly support the formation of the CsPbBr3/HZIF-8 MOF composite.
|
||||
|
||||

|
||||
Scheme 1. Schematic illustration of the development of CsPbBr3/HZIF-8 MOF nano composite and its sensing mechanism.
|
||||
|
||||

|
||||
Fig. 1. TEM images of $\mathrm{CsPbBr_{3}/H Z I F-8}$ 8 MOF composite (a, b), HRTEM images of enlarged view of the core of the composite/selected zone (c, d), XRD pattern of MOF and PeQD/MOF composite (e) and SEM image of HZIF-8, inset: close up view of the MOF (f).
|
||||
|
||||

|
||||
Fig. 2. XPS survey spectra of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite (a), XPS fine spectrum of Cs 3d (b), Pb 4f (c), Br 3d (d), $\mathbf{Z}\mathfrak{n}~2\mathfrak{p}$ (e), N 1 s (f).
|
||||
|
||||
To better understand the micro-morphology and structural evolution of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite throughout the formation of nano composite material, SEM and TEM micrograph pictures were taken. The SEM images of pristine HZIF-8 show monodisperse bouquet-like morphology with a densely packed surface with an average size of about $70\ \mathrm{nm}$ (Fig. 1f). The CsPbBr3/HZIF-8 MOF composite maintains MOF’s morphology that has been revealed from TEM images (Fig. 1a-d). The hierarchical porous characteristics of HZIF-8 encompassing both meso and microporosity are confirmed by the $\mathrm{N}_{2}$ adsorption–desorption isotherm with a prominent hysteresis loop. The pore size distribution of HZIF-8 in Fig. 3d and Table S1 illustrates the mixture of type I and type IV adsorption–desorption isotherms in HZIF-8. BET (Bru nauer–Emmett–Teller) surface area of the HZIF-8 is calculated to be $2065.237~\mathrm{m}^{2}\mathrm{g}^{-1}$ . When the nonporous PeQDs further grow in situ into the pore channels of HZIF-8, pore volume and the BET surface area of the perovskite confined MOF are significantly decreased, attributed to the successful incorporation of the PeQDs in the MOF matrix. Further, TEM images of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ show the well-defined $\mathsf{C s P b B r}_{3}$ PeNCs, are embedded by the porous MOF matrix forming a core shell type structure. A large number of $\mathsf{C s P b B r}_{3}$ quantum dots (QDs) can be seen as dark circular areas without any visible particles outside the HZIF-8 MOF matrix. The high-resolution transmission electron microscopy (HR
|
||||
|
||||
TEM) image of the constrained $\mathsf{C s P b B r}_{3}$ PeNCs in HZIF-8 MOF with clearly defined lattice spacing is shown in Fig. 1c and 1d, demonstrating their great crystallinity. The interplanar spacing (d- spacing) calculated from the lattice fringes are $0.288\,\mathrm{nm}$ and $0.238\,\mathrm{nm}$ corresponding to the (200) and (211) planes of the cubic $\mathsf{C s P b B r}_{3}$ respectively [35]. Based on the aforementioned results, the $\mathsf{C s P b X}_{3}$ QDs are safeguarded by the protective MOF shell that would lead to improved stability of the perovskite MOF binary composite.
|
||||
|
||||
Additionally, the identical diffraction peaks of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ imply that the growth of CsPbBr3 PeQDs into HZIF-8 does not compro mise its crystalline integrity. Because of the small size of $\mathsf{C s P b B r}_{3}$ and very high crystallinity of HZIF-8 MOF in comparison to CsPbBr3, the peak intensities associated with it could be screened by the diffraction peaks of the MOF matrix and hence the XRD pattern does not exhibit any prominent peaks of $\mathsf{C s P b B r}_{3}$ . From the TEM images of the $\mathsf{C s P b B r}_{3}/$ HZIF-8 composite the average grain size of the PeQDs was found to be about $4\,\textrm{n m}$ $<5\ \mathrm{nm})$ , much smaller than that of the MOF host. The relative intensities of the diffraction peak diminished and broadened which is likely due to the changes in electron density within the matrix with the loading of the $\mathsf{C s P b B r}_{3}$ into pores of the MOF matrix. Therefore the framework structure of MOF is preserved in all the prepared sam ples. Similar results were found in other literature also [34,35].
|
||||
|
||||
# 3.2. Photophysical properties:
|
||||
|
||||
Furthermore, we have explored the optical properties of all the synthesized materials. Fig. 3b displays the UV–Vis spectra of HZIF-8 and CPB/HZIF-8 where the MOF exhibits a weak absorption. In the presence of PeQDs, the absorption of HZIF-8 increases with an absorption peak at $509~\mathrm{nm}$ . The PeQDs without the MOF matrix shows similar absorption behavior with the CPB/HZIF-8 composite with the maxima starts at 512 nm. The composite shows an intense green emission with a peak centered at $510\,\mathrm{nm}$ with FWHM value of $25\,\mathrm{nm}$ on excitation with a light of $365\,\mathrm{nm}$ wavelength. The narrow FWHM value demonstrates both the color purity and the uniformity of the composite in terms of size and defects. A small blue shift of $12\,\mathrm{nm}$ with a slight wider FWHM from the $\mathsf{C s P b B r}_{3}$ without the HZIF-8 matrix indicates the confinement effect of the PeQD due to size restriction of the PeQDs by the MOF matrix. The PLQY of CPB/HZIF-8 composite in toluene solution was calculated to be $45.5\;\%$ using fluorescein as the reference standard.
|
||||
|
||||

|
||||
Fig. 3. Emission spectra of $\mathrm{Cs}\mathrm{Pb}\mathrm{Br}_{3}$ (No MOF- Red) and $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ (Black), inset: corresponding photographs of $\mathrm{Cs}\mathrm{Pb}{\tt B r}_{3}$ (B) and $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ (A) under $365\,\mathrm{nm}$ UV lamp (a); Absorption spectra of HZIF-8 (black), $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ (green) and $\mathrm{Cs}\mathrm{Pb}{\tt B}\mathrm{r}_{3}$ (purple) (b); TRPL decay graphs of $\mathrm{{Cs}\mathrm{{Pb}\mathrm{{Br}_{3}}}}$ (red) and $\mathrm{CsPbBr_{3}/H Z I F}$ - 8 (black) (c); Pore size distributions and $\mathrm{N}_{2}$ adsorption–desorption isotherms (inset) of HZIF-8 and $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ (d).
|
||||
|
||||
The time resolved PL decay dynamics of bare $\mathsf{C s P b B r}_{3}$ and CPB/ HZIF-8 composite were further studied in which the MOF embedded $\mathsf{C s P b B r}_{3}$ shows slower decay kinetics than the bare $\mathsf{C s P b B r}_{3}$ . The decay curves were fitted using a tri-exponential fitting model and the decay parameters are summarized in Table 1. The average life times for bare $\mathsf{C s P b B r}_{3}$ and CPB/HZIF-8 were calculated to be 25.23 ns and 40.13 ns respectively. The increase in the life time of $\mathsf{C s P b B r}_{3}$ is associated with the surface defect passivation of PeQDs by the MOF matrix. The formula used to calculate average life time is-.
|
||||
|
||||
$$
|
||||
\tau_{a\nu g}=\ \left(\sum{A_{i}\tau_{i}^{2}}\right)\Big/\sum{A_{i}\tau_{i}}\ ,\ i=1,2,3
|
||||
$$
|
||||
|
||||
where $\tau_{\mathrm{i}}$ signifies parameters of lifetime decay and $\mathrm{A}_{\mathrm{i}}$ is a constant termed as pre-exponential factors, $\tau_{\mathrm{avg}}$ denotes the average lifetime [47].
|
||||
|
||||
# 3.3. Stability test of $H Z I F\!\!-\!\!8/C s P b B r_{3}$ composite:
|
||||
|
||||
In general, due to the extra sensitive nature of $\mathsf{C s P b B r}_{3}$ PeNCs, their applications in analytical field are limited. The HZIF-8 MOF matrix provides extra stability to the PeNCs, making it highly stable in ambient conditions. Fig. 4 shows the stability of the composite powder against long term storage in open air conditions at room temperature $(-70\%$ humidity). The resulting powder retains its $95~\%$ of luminescence in tensity for about two months whereas the PL intensity of CPB PeNCs is practically quenched. It’s encouraging to find that our composite has great moisture resistance, ascribed to the MOF protective shell which effectively prevents $\mathsf{C s P b B r}_{3}$ from oxygen and moisture leakage. Furthermore, the UV photo-stability of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite was tested by being exposed to $365~\mathrm{nm}$ UV light in an ambient atmo sphere, with the PL intensity being checked at various intervals of exposure time. On exposure of $80\,\mathrm{h}_{\mathrm{}}$ , the composite maintains $80\,\%$ of its initial PL intensity. The use of PeQDs is limited exclusively to non-polar solvents. However, when the PeQDs are absorbed inside the pores of MOF, the aforementioned restriction is eliminated. The aqueous stabil ity of the prepared $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite was investigated by recording its PL after soaking the composite powder in an aqueous so lution. The composite formed a well dispersed solution in water while maintaining its intense green emission. We checked the emission spectra of the water dispersed solution for a time period of 15 days (Fig. 4c). Also, the composite maintains its intense green emission in other polar protic solvents.
|
||||
|
||||
Table 1 Summary of TRPL decay lifetimes result.
|
||||
|
||||

|
||||
|
||||
# 3.4. $C s P b B r_{3},$ /HZIF-8 composite in chemical sensing application:
|
||||
|
||||
As discussed above, the highly improved fluorescence properties of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite material enable them to utilize in optical sensing applications. Therefore, the potential use of the suggested ma terial in the field of fluorescence chemo sensing was examined.
|
||||
|
||||
# 3.4.1. Detection of $C u^{2+}$ ions:
|
||||
|
||||
When exposed to various metal ions, the PeQD-encapsulated HZIF-8 exhibits a PL quenching response. Here, we looked at the feasibility of employing the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite for copper ion detection in aqueous solutions. Fig. 5a displays the PL response of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ with increasing concentration of $\mathrm{{Cu}}^{2+}$ . The intense green emission of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ , centered at $510\,\mathrm{nm}$ was quenched with the addition of $\mathrm{{Cu}}^{2+}$ from 30 to $1500\;\mathrm{nM}$ concentration. The dilution had a very small impact on quenching and there was no solvent interference in the quenching of the green signal. Therefore, we can say that the $\mathrm{{Cu}}^{2+}$ ion in the probe solution is responsible for the observed quenching. SternVolmer equation (2) was used to analyze the quenching behavior of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ where the quenched ratio $\mathrm{F}_{0}/\mathrm{F}$ and copper con centration are shown to be linearly correlated (Fig. 5b) in the range of 30–1500 nM.
|
||||
|
||||
$$
|
||||
F_{o}/F=1+K_{S V}[C]
|
||||
$$
|
||||
|
||||
where $\mathrm{F_{o}}$ and $\boldsymbol{\mathrm{F}}$ denote the PL intensities of PeQD before and after the addition of the analyte respectively, [C] is the concentration of $\mathrm{{Cu}}^{2+}$ in solution (nM) and $\mathrm{K}_{\mathrm{SV}}$ represents the Stern –Volmer quenching constant. The calibration curve was successfully fitted to the equation, $\Upsilon=0.0045$ $X-0.305$ , with a correlation coefficient $(\mathbf{R}^{2})$ of 0.9957. The quenching constant $\mathbf{(K_{sv})}$ value was found to be $4.5\ \times10^{\mathrm{~6~}}\,\mathrm{M}^{-1}$ . The limit of detection (LOD) of $\mathrm{{Cu}}^{2+}$ was determined to be $4.66~\mathrm{nM}$ using the rela tion $3\upsigma/\mathsfit{S}$ , where S is slope of the linear calibration graph and $\upsigma$ repre sents standard deviation [48]. LOD for $\mathrm{{Cu}}^{2+}$ is significantly below the World Health Organization (WHO) recommendation value of $\mathrm{{Cu}}^{2+}$ in drinking water, that is $1\mathrm{~g~L^{-1}}[49]$ . When compared to previous litera tures for $\mathrm{{Cu}}^{2+}$ detection, our sensing method reflects a competitive sensitivity value for luminescence based $\mathrm{{Cu}}^{2+}$ detection in aqueous media (Table S2).
|
||||
|
||||
Response time and stability, important parameters which indicate the performance of a sensing system were recorded for $\mathrm{{Cu}}^{2+}$ detection. The quenching proceeded so quickly that the emission intensity reduced in less than two min, and no effective change in intensity was detected after that, showing the system’s remarkable stability. The curve of PL intensity of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ in the presence of a particular quantity of $\mathrm{{Cu}}^{2+}$ during a $2\,\mathrm{{h}}$ incubation time is shown in Fig. 5c. This result verifies the system’s capacity to detect $\mathrm{{Cu}}^{2+}$ regardless of reaction time.
|
||||
|
||||
The effects of PL quenching of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ with different metal ions solution including $\mathrm{Zn^{2+},B i^{2},S r^{2+},F e^{2+},M g^{2+},N i^{2+},A l^{3+},C a^{2+},K^{+}}$ $\mathrm{Na}^{+},\mathrm{C}\mathrm{d}^{2+}$ , $\mathtt{M n}^{2+}$ , $\mathrm{Pb}^{2+}$ , $\mathrm{Pt}^{2+}$ , ${\bf A}{\bf g}^{+}$ , $\mathrm{Pd}^{2+}$ , $\mathrm{Fe}^{3+}$ , ${\mathrm{Co}}^{2+}$ , ${\mathsf{S n}}^{4+}$ and $\mathrm{Ti}^{4+}$ , etc. $(1\ \upmu\mathrm{M})$ were investigated to assess the selectivity of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ towards $\mathrm{{Cu}}^{2+}$ detection. When compared to other metal ions, $\mathrm{{Cu}}^{2+}$ displayed the highest quenching of the PL signal, indicating the obvious selectivity of the sensing probe towards $\mathrm{{Cu}}^{2+}$ . Fig. 5d shows the inter ference of different metal ions in the relative fluorescence intensity of the composite with or without the presence of $\mathrm{{Cu}}^{2+}$ . No noticeable change in the PL quenching of $\mathrm{{Cu}}^{2+}$ was observed with other metal ions confirming the practical applicability of this method for the determi nation of $\mathrm{{Cu}}^{2+}$ . Variations in various parameters, such as steric in teractions, redox potential differences, and metal-surface interactions, are expected to complicate this selectivity [3], while the precise causes of these discrepancies are beyond the purview of this investigation. We can conclude that of the studied metal ions, the designed $\mathrm{CsPbBr_{3}/H Z I F}.$ - 8 composite reveals a high selectivity for $\mathrm{{Cu}}^{2+}$ ion detection. This approach was further employed to detect Cu traces in tap water samples. As a result, the suggested approach has high accuracy and reliability for identifying low concentrations of Cu in natural samples (Table 2).
|
||||
|
||||

|
||||
Fig. 4. Storage test of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ (a, b) Intensity vs. time plot of bare $\mathrm{Cs}\mathrm{Pb}{\tt B r}_{3}$ (red) and $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ (black), inset: Photographs of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ powder in day 1 and day 60 under $365~\mathrm{nm}$ UV light, (c) Emission spectra evolution of aqueous $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ dispersion for 15 days and (d) UV Photo-stability test of the composite.
|
||||
|
||||
# 3.4.2. Analytical performance for melamine sensing:
|
||||
|
||||
When melamine is added to $\mathrm{CsPbBr_{3}/H Z I F{-8-C u^{2+}}_{\xi}}$ system under op timum experimental conditions, it could release the adsorbed copper ion from the surface of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ , and eventually restore the quenched signal. The addition of only melamine to the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ has negligible impact on the emission spectrum of the composite (Fig. 6a). In order to achieve the optimal and sensitive “off-on” response of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ probe for melamine detection, the influence of re action time and pH value was studied (Fig. S3b and c). The recovery of the PL signal begins within a minute and rises with time from 0 to $5\,\mathrm{min}$ , and reaches a maximum at $10\;\mathrm{min}$ . The system becomes steady after 10 min and has been tested for up to $30\;\mathrm{min}$ . Similarly, with the increasing pH from 4.5 to 6.5, the recovery increases and then decreases at higher pH. An incubation time of $10\,\mathrm{min}$ and a solution pH of 6.5 was used for further sensing experiments. The FL response of CsPbBr3/HZIF-8- $\mathrm{{Cu}}^{2+}$ system with the increasing concentration of melamine from 0 to $500\,\mathrm{nM}$ is presented in Fig. 6b. Fig. 6c shows the correlation between the concentration of melamine and the PL recovery efficiency $(\mathrm{F}_{0}-\mathrm{F})/\mathrm{F}$ , $\mathrm{F}_{0}$ and F represent the PL intensity of the composite in presence and absence of melamine, respectively. It is possible to express the calibra tion curve as a linearly fitted equation $(\mathrm{F}-\mathrm{F}_{0})/\mathrm{F}_{0}=0.0051\mathrm{C}+0.0866$ $(\mathrm{R}^{2}\,{=}\,0.9906)$ . The LOD $\left(3\upsigma/\mathsf{s}\right)$ for melamine detection was estimated to be $2.64\,\mathrm{nM}$ , which is equivalent to and even better than those described in the literature (Table S3).
|
||||
|
||||
# 3.5. A plausible mechanism of sensing:
|
||||
|
||||
# 3.5.1. Quenching mechanism:
|
||||
|
||||
To discuss the quenching phenomenon by $\mathrm{{Cu}}^{2+}$ ion, we first studied the UV–Vis absorption spectra of the fluorophore and the analyte. Fig. 7a shows that the absorption spectrum of $\mathrm{{Cu}}^{2+}$ has no significant overlap with the excitation and emission spectra of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ , which eliminates the possibility of the FRET (Forster resonant energy transfer) and IFE (Inner filter effect) mechanism of quenching. Addi tionally, no shift or alteration of the absorption and emission spectra has been seen following the introduction of $\mathrm{{Cu}}^{2+}$ into the probe solution (Fig. S4a). It indicates that the crystal structure or conformation of the MOF protected QDs has not changed, excluding the static quenching mechanism that leads to the production of a ground state complex. XPS analysis was done prior to and after the interaction with $\mathrm{{Cu}}^{2+}$ to establish the adsorption of Cu ion on the surface of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite (Fig. 8a). High resolution XPS spectra of $\mathrm{Cu}~2\mathrm{p}$ showed a strong signal at $932.5\,{\mathrm{eV}}$ , which was assigned to Cu $2{\tt p}_{3/2}$ hybrid orbits, (Fig.S5) on $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ MOF surface.
|
||||
|
||||

|
||||
Fig. 5. $\mathrm{{Cu}}^{2+}$ sensing- PL spectra of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ with various concentration of $\mathrm{{Cu}}^{2+}$ (a), Calibration graph versus concentration of $\mathrm{{Cu}}^{2+}$ (b), graph showing the change in PL intensity of the sensor throughout the incubation period of $120\;\mathrm{min}$ (c), relative PL intensity of the composite with various other metal ion before (black) and after the addition of $\mathrm{{Cu}}^{2+}$ (cyan) (d).
|
||||
|
||||
Table 2 $\mathrm{{Cu}}^{2+}$ detection in real aqueous solution.
|
||||
|
||||

|
||||
|
||||
TRPL decay dynamics were explored to analyze the sensor’s quenching kinetics. The fluorescence lifetime of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ com posite decreases as $\mathrm{{Cu}}^{2+}$ concentration increases. As shown in Fig. 7b when $160~\mathrm{nM}$ and $100\mathrm{\;nM\,Cu}^{2+}$ is present, the average lifetime drasti cally drops to 19.29 ns and 21.61 ns from 40.13 ns (Table 1). Faster decay dynamics of the fluorophore in presence of the target analyte offer a clear illustration of the nonradiative recombination pathways. It in dicates a dynamic quenching process. The redox potential of $\mathrm{Cu}^{2+}/\mathrm{Cu}^{+}$ located in between the VB and CB of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite (VB & $\mathrm{CB}=1.03~\mathrm{eV}$ and $1.39\,{\mathrm{eV}}$ vs NHE). Electron transfer might occur to the $\mathrm{{Cu}}^{2+}$ (Scheme 2) [50]. Cyclic voltammetry (CV) analysis of $\mathsf{C s P b B r}_{3}/$ HZIF-8 composite was performed to determine the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energy levels (details mentioned in the supporting file). Since the HZIF-8 MOF provide binding sites for $\mathrm{{Cu}}^{2+}$ ion and then it diffuse rapidly to the $\mathsf{C s P b B r}_{3}$ perovskite. In perovskite lattices, ion migration occurs very easily due to its ionic nature. A coordination complex (-BrCu-N-MOF) can be formed due to the interaction between $\mathrm{Br-}$ ion of $\mathsf{C s P b B r}_{3}$ and Cu metal ion. In the FTIR spectra (Fig. S6b) of the com posite after $\mathrm{{Cu}}^{2+}$ addition, we have observed a distinct peak shifting of N-H stretching vibration indicating the formation of coordination complex. Also, the XPS spectra of Br was positively shifted after inter action with $\mathrm{{Cu}}^{2+}$ ion (Fig. 9a), indicating the coordination between bromine of the perovskite and Cu metal ion. Thus, the interaction with $\mathrm{{Cu}}^{2+}$ might produce some new surface states or defect levels in the perovskite nanocrystals facilitating nonradiative pathways for electron/ hole recombination and finally quench the FL signal of $\mathsf{C s P b B r}_{3}$ perov skite composite. After the formation of an exciton, the hole created in the HOMO of the perovskite is favoured to be filled by these new states [3,27,50–52].
|
||||
|
||||
# 3.5.2. Fluorescence recovery by melamine:
|
||||
|
||||
As mentioned in the above section, the PL quenching process can be explained by both electron transfer and defect level mediated dynamic process due to their coordinative interaction with $\mathrm{{Cu}}^{2+}$ ion. When melamine is added, there is a competitive binding interaction between perovskite and the melamine. Due to strong binding interaction of melamine with Cu, it removes the Cu ion from the surface of the composite, thereby eliminating the nonradiative pathways of recombi nation of electron/hole and finally recovers the PL. TRPL decay dy namics of $\mathrm{CsPbBr_{3}/H Z I F{-}8{-}C u}$ in presence of melamine is presented in Fig. 8b. The melamine itself showed no significant change in the FL lifetime of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ . The reduced average lifetime of CsPbBr3/HZIF-8-Cu was restored from 19.28 ns to 37.32 ns after the introduction of $250~\mathrm{nM}$ melamine (Table1) signifying the removal of non-radiative recombination pathways. The interaction of melamine with Cu can be verified by the UV –Vis absorption spectra presented in Fig. $\mathrm{~8~c~}$ where the presence of Cu significantly changes the absorption spectrum of melamine. The absorption peak of melamine at $202~\mathrm{nm}$ shifts towards the right and the peak at about $234~\mathrm{nm}$ disappears with the addition of $\mathrm{{Cu}}^{2+}$ . The FTIR spectra also show a minor change in peak with the addition of $\mathrm{{Cu}}^{2+}$ (Fig. S4b). All these findings prove the strong affinity of melamine towards Cu metal ion that is partially competitively captured by melamine. This statement is also supported by the XPS spectra, where the intensity of the Cu signal from the sensing probe decreased after melamine was added to the CsPbBr3/HZIF-8-Cu system (Fig. S5). Moreover the shift noticed in the XPS spectrum of Br after Cu metal ion addition was disappeared after melamine was added to the sensing probe confirms the aforesaid findings (Fig. 9b).
|
||||
|
||||

|
||||
Fig. 6. $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ for melamine sensing- (a) inset: Photographs showing the recovery of green emission of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ -Cu with melamine addition (left to right);(b) PL response of $\mathrm{(CsPbBr_{3}/H Z I F{-8}+C u)}$ with the addition of different concentration of melamine; (c) Calibration curve of fluorescence recovery vs. con centration of melamine; (d) PL recovery efficiency of the sensing probe system with other biological molecules.
|
||||
|
||||

|
||||
Fig. 7. (a) UV–vis absorption spectra of $\mathrm{{Cu}}^{2+}$ (Red), excitation (black) and emission spectra (blue) of $\mathrm{{Cs}\mathrm{{Pb}\mathrm{{Br}_{3}.}}}$ /HZIF-8, (b) TRPL decay dynamics of $\mathrm{CsPbBr}_{3}.$ /HZIF-8 in presence of different concentration of $\mathrm{{Cu}}^{2+}$ .
|
||||
|
||||

|
||||
Fig. 8. (a) XPS spectra of the $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ in presence of cu (red) and melamine (blue), (b) Fluorescence decay graph of CsPbBr3/HZIF-8-Cu with various concentration of melamine, (c) UV– vis absorption spectra of melamine (red), $\mathrm{{Cu}}^{2+}$ (purple), Cu-melamine (blue, green).
|
||||
|
||||

|
||||
Scheme 2. Band energy alignment of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite with Cu metal ion.
|
||||
|
||||

|
||||
Fig. 9. (a) High resolution XPS spectra of Br with and without the addition of $\mathrm{{Cu}}^{2+}$ metal ion, (b) Comparison with the XPS spectra of Br after melamine addition to the $\mathrm{CsPbBr_{3}/H Z I F{-}8{\mathrm{-}C u}}.$ .
|
||||
|
||||
Furthermore, to investigate the framework stability of the composite after the sensing experiment, P-XRD of the sample was analyzed (Fig. S7). No change in the characteristic diffraction peaks in XRD from the pristine MOF indicates that the framework is not damaged.
|
||||
|
||||
Selectivity: To determine the selectivity of the sensing system for the detection of melamine, the effect of some possible interfering substances including glycine (Gly), Cysteine (Cys), Glutamic acid (Glu), P-phenyl enediamine (P-Phe), Serine (Ser), alanine (ala), Fructose, Glucose, vit. C were examined. Fig. 6d shows the luminescence of the CsPbBr3/HZIF-8- Cu composite is not significantly impacted by other substances except melamine with multifunctional heterocyclic system. The foregoing data imply that our sensor has an adequate selectivity.
|
||||
|
||||
# 3.6. Practical application of PeQD embedded HZIF-8 for melamine detection in milk samples:
|
||||
|
||||
To know the practicability of the sensor, we further employ this switchable fluorescent nano sensor to detect melamine in milk samples (liquid raw milk and infant formula). Due to the absence of melamine in the aforementioned milk samples, the standard spiked recovery studies were utilized to assess the precision of our established probe. The mel amine was spiked at various quantities in each sample, and the fluo rescence signal was then assessed. The data presented in Table 3 showed the melamine recoveries in the spiked samples at three different con centration from $94.7~\%$ to $100.8~\%$ with RSD (relative standard devia tion) not exceeding $6.37~\%$ , signifying that the proposed fluorescent sensing platform is a reliable method for the detection of melamine in dairy products and has a good applicability.
|
||||
|
||||
Table 3 Summary of melamine detection in real samples.
|
||||
|
||||

|
||||
|
||||
# 4. Conclusion
|
||||
|
||||
In this approach, we designed a stable and effective sensing platform by loading $\mathsf{C s P b B r}_{3}$ PeQDs into the porous HZIF-8, using an easy two step in-situ growth method. Hierarchically porous HZIF-8 MOF diffuses PeQD better than microporous MOF hosts, enabling uniform PeQD dis tribution and provide better stability to the PeQD than bare PeQDs. All the relevant properties of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composite were thoroughly analyzed. Good fluorescence intensity and great stability were main tained after the transition of the produced $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ composites to the aqueous phase. Further, this nano sensor has been utilized for the on–off-on detection of $\mathrm{{Cu}}^{2+}$ and melamine. The $\mathrm{{Cu}}^{2+}$ analyte can act as an effective quencher that quenches the green emission of $\mathsf{C s P b B r}_{3}$ through a dynamic quenching and electron transfer process. The quenched emission of $\mathrm{CsPbBr_{3}/H Z I F{-8}}$ can be restored by the competi tive adsorption of $\mathrm{{Cu}}^{2+}$ from the surface of the sensor by the functional amine group of melamine. This sensor is found to be very sensitive to wards the detection of $\mathrm{{Cu}}^{2+}$ and melamine and a good linear relation ship was found. Because copper has a strong attraction to melamine, this FL assay is more sensitive and selective and is used to find melamine in real samples with satisfactory recoveries. Notably, the approach of embedding $\mathsf{C s P b B r}_{3}$ nanoparticles with hierarchically porous matrix to generate some custom tailored properties would enhance experimental design process of stable PeQDs and broaden its application window in sensing field.
|
||||
|
||||
# CRediT authorship contribution statement
|
||||
|
||||
Shahnaz Ahmed: Writing – original draft, Validation, Methodology, Formal analysis, Conceptualization. Suman Lahkar: Visualization. Simanta Doley: Visualization. Dambarudhar Mohanta: Investigation, Resources. Swapan Kumar Dolui: Supervision, Investigation, Writing – review & editing.
|
||||
|
||||
# Declaration of Competing Interest
|
||||
|
||||
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence
|
||||
|
||||
the work reported in this paper.
|
||||
|
||||
# Data availability
|
||||
|
||||
Data will be made available on request.
|
||||
|
||||
# Acknowledgements
|
||||
|
||||
The author would like to acknowledge SAIF, CSIR-NEIST, Jorhat, Assam, India for XPS analysis, Material Analysis & Research Centre, Bengaluru for BET analysis, UGC-SAP-DRS-II, Tezpur University for TRPL analysis and Sophisticated Analytical Instrumentation Centre (SAIC), Tezpur University, India for providing remaining analytical support. The author is grateful to DST-INSPIRE Fellowship, DST, Govt. of India for funding.
|
||||
|
||||
# Appendix A. Supplementary data
|
||||
|
||||
Supplementary data to this article can be found online at https://doi. org/10.1016/j.jphotochem.2023.114821.
|
||||
|
||||
# References
|
||||
|
||||
[1] M. Jaishankar, T. Tseten, N. Anbalagan, B. B. Mathew, K. N. Beeregowda, Interdisciplinary toxicology, 7 (2014) 60. https://doi.org/10.2478%2Fintox-2014- 0009.
|
||||
[2] X. Guo, J. Huang, M. Wang, L. Wang, Sens. Actuators B 309 (2020), 127766, https://doi.org/10.1016/j.snb.2020.127766.
|
||||
[3] H. Li, W. Yin, C.K. Ng, R. Huang, S. Du, M. Sharma, J.J. Jasieniak, Nanoscale 14 (2022) (1962) 11953, https://doi.org/10.1039/D2NR02737B.
|
||||
[4] L. Li, G. Wu, T. Hong, Z. Yin, D. Sun, E.S. Abdel-Halim, J.J. Zhu, ACS Appl. Mater. Interfaces 6 (2014) 2858–2864, https://doi.org/10.1021/am405305r.
|
||||
[5] L. Zhu, G. Gamez, H. Chen, K. Chingin, R. Zenobi, Chem. Commun. 5 (2009) 559–561, https://doi.org/10.1039/B818541G.
|
||||
[6] Q. Lu, J. Zhao, S. Xue, P. Yin, Y. Zhang, S. Yao, Analyst 140 (2015) 1155–1160, https://doi.org/10.1039/C4AN01847H.
|
||||
[7] G. Venkatasami, J.R. Sowa Jr, Anal. Chim. Acta 665 (2010) 227–230, https://doi. org/10.1016/j.aca.2010.03.037.
|
||||
[8] S. Goscinny, V. Hanot, J.F. Halbardier, J.Y. Michelet, J. Van Loco, Food Control 22 (2011) 226–230, https://doi.org/10.1016/j.foodcont.2010.04.032.
|
||||
[9] C.W. Liao, Y.R. Chen, J.L. Chang, J.M. Zen, J. Agric. Food Chem. 59 (2011) 9782–9787, https://doi.org/10.1021/jf201989f.
|
||||
[10] W. Chen, H.H. Deng, L. Hong, Z.Q. Wu, S. Wang, A.L. Liu, X.H. Xia, Analyst 137 (2012) 5382–5386, https://doi.org/10.1039/C2AN35962F.
|
||||
[11] J. Liu, Y. Zhong, J. Liu, H. Zhang, J. Xi, J. Wang, Food Control 21 (2010) 1482–1487, https://doi.org/10.1016/j.foodcont.2010.04.018.
|
||||
[12] L.M. Chen, Y.N. Liu, ACS Appl. Mater. Interfaces 3 (2011) 3091–3096, https://doi. org/10.1021/am200603y.
|
||||
[13] J. Guan, Y.Z. Shen, Y. Shu, D. Jin, Q. Xu, X.Y. Hu, Adv. Mater. Interfaces 8 (2021) 2100588, https://doi.org/10.1021/acs.analchem.1c04348.
|
||||
[14] L. Yang, Y.L. Liu, X.X. Ji, C.G. Liu, Y. Fu, F. Ye, J. Taiwan Inst. Chem. Eng. 126 (2021) 173–181, https://doi.org/10.1016/j.jtice.2021.07.028.
|
||||
[15] Y. Liu, L. Yang, Q. Bai, W. Li, Y. Zhang, Y. Fu, F. Ye, Chem. Eng. J. 420 (2021), 129939, https://doi.org/10.1016/j.cej.2021.129939.
|
||||
[16] Y. Liu, L. Li, M. Yue, L. Yang, F. Sun, G. Xu, F. Ye, Chem. Eng. J. 430 (2022), 132758, https://doi.org/10.1016/j.cej.2021.132758.
|
||||
[17] L. Li, S. Gao, L. Yang, Y.L. Liu, P. Li, F. Ye, Y. Fu, J. Hazard. Mater. 404 (2021), 124015, https://doi.org/10.1016/j.jhazmat.2020.124015.
|
||||
[18] C.H. Lei, X.E. Zhao, S.L. Jiao, L. He, Y. Li, S.Y. Zhu, J.M. You, Anal. Methods 8 (2016) 4438–4444, https://doi.org/10.1039/C6AY01063F.
|
||||
[19] V.V. Halali, C.G. Sanjayan, V. Suvina, M. Sakar, R.G. Balakrishna, Inorg. Chem. Front. 7 (2020) 2702–2725, https://doi.org/10.1039/D0QI00306A.
|
||||
[20] D.C. Wang, Y. Lei, W. Jiao, Y.F. Liu, C.H. Mu, X. Jian, Rare Met. 40 (2021) 3–19, https://doi.org/10.1007/s12598-020-01622-y.
|
||||
[21] F. Sun, L. Yang, S. Li, Y. Wang, L. Wang, P. Li, Y. Fu, J. Agric. Food Chem. 69 (2021) 12661–12673, https://doi.org/10.1021/acs.jafc.1c05246.
|
||||
[22] F. Zhang, H. Zhong, C. Chen, X.G. Wu, X. Hu, H. Huang, Y. Dong, ACS Nano 9 (2015) 4533–4542, https://doi.org/10.1021/acsnano.5b01154.
|
||||
[23] Y. Wang, M.I. Dar, L.K. Ono, T. Zhang, M. Kan, Y. Li, Y. Zhao, Science 365 (2019) 591–595, https://doi.org/10.1126/science.aav8680.
|
||||
[24] L.Q. Lu, T. Tan, X.K. Tian, Y. Li, P. Deng, Anal. Chim. Acta 986 (2017) 109–114, https://doi.org/10.1016/j.aca.2017.07.014.
|
||||
[25] X. Chen, C. Sun, Y. Liu, L. Yu, K. Zhang, A.M. Asiri, S. Wang, Chem. Eng. J. 379 (2020), 122360, https://doi.org/10.1016/j.cej.2019.122360.
|
||||
[26] X. Xiang, H. Ouyang, J. Li, Z. Fu, Sens. Actuators B 346 (2021), 130547, https:// doi.org/10.1016/j.snb.2021.130547.
|
||||
[27] X. Sheng, Y. Liu, Y. Wang, Y. Li, X. Wang, X. Wang, X. Xu, Adv. Mater. 29 (2017) 1700150, https://doi.org/10.1002/adma.201700150.
|
||||
[28] C. Huangfu, L. Feng, Sens. Actuators B 344 (2021), 130193, https://doi.org/ 10.1016/j.snb.2021.130193.
|
||||
[29] H. Yang, W. Yin, W. Dong, L. Gao, C.H. Tan, W. Li, J. Zhang, J. Mater. Chem. C 8 (2020) 14439–14445, https://doi.org/10.1039/D0TC03510F.
|
||||
[30] J.Y. Sun, F.T. Rabouw, X.F. Yang, X.Y. Huang, X.P. Jing, S. Ye, Q.Y. Zhang, Adv. Funct. Mater. 27 (2017) 1704371. https://onlinelibrary.wiley.com/journal /16163028.
|
||||
[31] Q. Zhong, M. Cao, H. Hu, D. Yang, M. Chen, P. Li, Q. Zhang, ACS Nano 12 (2018) 8579–8587, https://doi.org/10.1021/acsnano.8b04209.
|
||||
[32] Z.J. Li, E. Hofman, J. Li, A.H. Davis, C.H. Tung, L.Z. Wu, W. Zheng, Adv. Funct. Mater. 28 (2018) 1704288, https://doi.org/10.1002/adfm.201704288.
|
||||
[33] T. Xia, Y. Lin, W. Li, M. Ju, Chin. Chem. Lett. 32 (2021) 2975–2984, https://doi. org/10.1016/j.cclet.2021.02.058.
|
||||
[34] J. Cuan, D. Zhang, W. Xing, J. Han, H. Zhou, Y. Zhou, Chem. Eng. J. 425 (2021), 131556, https://doi.org/10.1016/j.cej.2021.131556.
|
||||
[35] J. Ren, T. Li, X. Zhou, X. Dong, A.V. Shorokhov, M.B. Semenov, Y. Wang, Chem. Eng. J. 358 (2019) 30–39, https://doi.org/10.1016/j.cej.2018.09.149.
|
||||
[36] S. Mollick, T.N. Mandal, A. Jana, S. Fajal, A.V. Desai, S.K. Ghosh, ACS Applied Nano Materials 2 (2019) 1333–1340, https://doi.org/10.1021/acsanm.8b02214.
|
||||
[37] J.H. Cha, K. Noh, W. Yin, Y. Lee, Y. Park, T.K. Ahn, O. Terasaki, The J. Phys. Chem. Lett. 10 (2019) 2270–2277, https://doi.org/10.1021/acs.jpclett.9b00510.
|
||||
[38] Y. Cao, Y. Zhou, Y. Lin, J.J. Zhu, Anal. Chem. 93 (2020) 1818–1825, https://doi. org/10.1021/acs.analchem.0c04717.
|
||||
[39] L. Yang, Y.L. Liu, C.G. Liu, F. Ye, Y. Fu, J. Hazard. Mater. 381 (2020), 120966, https://doi.org/10.1016/j.jhazmat.2019.120966.
|
||||
[40] X.L. Yang, C. Ding, R.F. Guan, W.H. Zhang, Y. Feng, M.H. Xie, J. Hazard. Mater. 403 (2021), 123698, https://doi.org/10.1016/j.jhazmat.2020.123698.
|
||||
[41] J.M. Yang, X.W. Hu, Y.X. Liu, W. Zhang, Microporous Mesoporous Mater. 274 (2019) 149–154, https://doi.org/10.1016/j.micromeso.2018.07.042.
|
||||
[42] L. Guo, Y. Liu, R. Kong, G. Chen, H. Wang, X. Wang, F. Qu, Sens. Actuators B 295 (2019) 1–6, https://doi.org/10.1016/j.snb.2019.05.064.
|
||||
[43] R. Jalili, A. Khataee, M.R. Rashidi, R. Luque, Sens. Actuators B 297 (2019), 126775, https://doi.org/10.1016/j.snb.2019.126775.
|
||||
[44] H.N. Abdelhamid, Z. Huang, A.M. El-Zohry, H. Zheng, X. Zou, Inorg. Chem. 56 (2017) 9139–9146, https://doi.org/10.1021/acs.inorgchem.7b01191.
|
||||
[45] Z. Zhang, X. Luo, B. Wang, J. Zhang, ACS Applied Energy Materials 2 (2019) 2760–2768, https://doi.org/10.1021/acsaem.9b00098.
|
||||
[46] Z.C. Kong, J.F. Liao, Y.J. Dong, Y.F. Xu, H.Y. Chen, D.B. Kuang, C.Y. Su, ACS Energy Lett. 3 (2018) 2656–2662, https://doi.org/10.1021/ acsenergylett.8b01658.
|
||||
[47] M. Worku, Y. Tian, C. Zhou, H. Lin, M. Chaaban, L.J. Xu, B. Ma, Sci. Adv. 6 (2020) eaaz5961, https://doi.org/10.1126/sciadv.aaz5961.
|
||||
[48] N. Ding, D. Zhou, G. Pan, W. Xu, X. Chen, D. Li, H. Song, A.C.S. Sustain, Chem. Eng. 7 (2019) 8397–8404, https://doi.org/10.1021/acssuschemeng.9b00038.
|
||||
[49] X. Jin, H. Chen, W. Zhang, B. Wang, W. Shen, H. Lu, Appl. Organomet. Chem. 32 (2018) e4577.
|
||||
[50] J. Tian, Q. Liu, A.M. Asiri, A.O. Al-Youbi, X. Sun, Anal. Chem. 85 (2013) 5595–5599, https://doi.org/10.1021/ac400924j.
|
||||
[51] S. Huang, M. Guo, J. Tan, Y. Geng, J. Wu, Y. Tang, Y. Liang, ACS Appl. Mater. Interfaces 10 (2018) 39056–39063, https://doi.org/10.1021/acsami.8b14472.
|
||||
[52] L.H. Jin, C.S. Han, Anal. Chem. 86 (2014) 7209–7213, https://doi.org/10.1021/ ac501515f.
|
||||
0
clean/__init__.py
Normal file
0
clean/__init__.py
Normal file
273
clean/preprocess_mineru.py
Normal file
273
clean/preprocess_mineru.py
Normal file
@@ -0,0 +1,273 @@
|
||||
import re
|
||||
import os
|
||||
import json
|
||||
import copy
|
||||
import requests
|
||||
import time
|
||||
import sqlite3
|
||||
import PyPDF2
|
||||
import multiprocessing
|
||||
import mysql.connector
|
||||
|
||||
from loguru import logger
|
||||
from glob import glob
|
||||
from tqdm import tqdm
|
||||
|
||||
from magic_pdf.pipe.UNIPipe import UNIPipe
|
||||
from magic_pdf.pipe.OCRPipe import OCRPipe
|
||||
from magic_pdf.pipe.TXTPipe import TXTPipe
|
||||
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
|
||||
import magic_pdf.model as model_config
|
||||
|
||||
model_config.__use_inside_model__ = True
|
||||
|
||||
# 图床配置
|
||||
IMGBED_URL = "http://localhost:40027/"
|
||||
# 检查imgbed url是否以/结尾
|
||||
if not IMGBED_URL.endswith('/'):
|
||||
IMGBED_URL += '/'
|
||||
token_endpoint = f"{IMGBED_URL}api/v1/tokens"
|
||||
upload_endpoint = f"{IMGBED_URL}api/v1/upload"
|
||||
|
||||
# 通过如下方式获取token
|
||||
# curl -X POST http://localhost:40027/api/v1/tokens -H "Content-Type: application/json" -d '{"email":"yt.li2@siat.ac.cn", "password":"lyt20000414."}'
|
||||
IMGBED_TOKEN = "6|QsBh5H7txY3Hd7ju1nzYKOBSdFQeL0YberydSFIH"
|
||||
|
||||
def replace_image_links(md_content: str, images_urls: dict) -> str:
|
||||
# 匹配 Markdown 中的图像链接形式,即: 
|
||||
pattern = r'!\[(.*?)\]\((.*?)\)'
|
||||
|
||||
def replace_link(match):
|
||||
# 提取出当前匹配到的图片路径
|
||||
image_path = match.group(2)
|
||||
# 检查该路径是否在字典中
|
||||
if image_path in images_urls:
|
||||
# 从字典中获取新的 URL
|
||||
new_url = images_urls[image_path]
|
||||
return f""
|
||||
return match.group(0)
|
||||
|
||||
# 使用 sub 函数进行替换
|
||||
updated_md_content = re.sub(pattern, replace_link, md_content)
|
||||
return updated_md_content
|
||||
|
||||
# 上传图片到LSKY Pro
|
||||
def upload_image(img_dir):
|
||||
headers = {
|
||||
"Authorization": f"Bearer {IMGBED_TOKEN}",
|
||||
'Accept': 'application/json'
|
||||
}
|
||||
|
||||
image_urls = {}
|
||||
os.makedirs(img_dir, exist_ok=True)
|
||||
img_names = os.listdir(img_dir)
|
||||
for image_name in img_names:
|
||||
retry = 0
|
||||
image_path = os.path.join(img_dir, image_name)
|
||||
while retry < 5: # 最大重试次数
|
||||
try:
|
||||
with open(image_path, 'rb') as image_file: # 确保文件在上传时是打开状态
|
||||
files = {'file': image_file}
|
||||
|
||||
# 上传文件
|
||||
response = requests.post(upload_endpoint, headers=headers, files=files)
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
if result['status']:
|
||||
image_url = result['data']['links']['url']
|
||||
image_urls['images/'+image_name] = image_url
|
||||
break # 上传成功,退出重试循环
|
||||
else:
|
||||
raise Exception(f"图片上传失败: {result['message']}")
|
||||
elif response.status_code == 429:
|
||||
# 429 响应,等待一段时间再重试
|
||||
wait_time = min(2 ** retry, 60) # 指数退避,最大等待 60 秒
|
||||
logger.warning(f"请求过于频繁,等待 {wait_time} 秒...")
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise Exception(f"HTTP请求出错: {response.status_code}")
|
||||
|
||||
retry += 1 # 增加重试次数
|
||||
time.sleep(1) # 在重试失败后稍等一下
|
||||
|
||||
except FileNotFoundError:
|
||||
logger.error(f"文件 {image_path} 不存在,请检查路径是否正确")
|
||||
return
|
||||
|
||||
return image_urls
|
||||
|
||||
def json_md_dump(
|
||||
pipe,
|
||||
md_writer,
|
||||
pdf_name,
|
||||
content_list,
|
||||
md_content,
|
||||
):
|
||||
# 写入模型结果到 model.json
|
||||
orig_model_list = copy.deepcopy(pipe.model_list)
|
||||
md_writer.write(
|
||||
content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
|
||||
path=f"{pdf_name}_model.json"
|
||||
)
|
||||
|
||||
# 写入中间结果到 middle.json
|
||||
md_writer.write(
|
||||
content=json.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4),
|
||||
path=f"{pdf_name}_middle.json"
|
||||
)
|
||||
|
||||
# text文本结果写入到 conent_list.json
|
||||
md_writer.write(
|
||||
content=json.dumps(content_list, ensure_ascii=False, indent=4),
|
||||
path=f"{pdf_name}_content_list.json"
|
||||
)
|
||||
|
||||
# 写入结果到 .md 文件中
|
||||
md_writer.write(
|
||||
content=md_content,
|
||||
path=f"{pdf_name}.md"
|
||||
)
|
||||
|
||||
def pdf_parse_main(
|
||||
pdf_path: str,
|
||||
parse_method: str = 'auto',
|
||||
model_json_path: str = None,
|
||||
is_json_md_dump: bool = True,
|
||||
output_dir: str = None
|
||||
):
|
||||
"""
|
||||
执行从 pdf 转换到 json、md 的过程,输出 md 和 json 文件到 pdf 文件所在的目录
|
||||
|
||||
:param pdf_path: .pdf 文件的路径,可以是相对路径,也可以是绝对路径
|
||||
:param parse_method: 解析方法, 共 auto、ocr、txt 三种,默认 auto,如果效果不好,可以尝试 ocr
|
||||
:param model_json_path: 已经存在的模型数据文件,如果为空则使用内置模型,pdf 和 model_json 务必对应
|
||||
:param is_json_md_dump: 是否将解析后的数据写入到 .json 和 .md 文件中,默认 True,会将不同阶段的数据写入到不同的 .json 文件中(共3个.json文件),md内容会保存到 .md 文件中
|
||||
:param output_dir: 输出结果的目录地址,会生成一个以 pdf 文件名命名的文件夹并保存所有结果
|
||||
"""
|
||||
try:
|
||||
pdf_name = os.path.basename(pdf_path).split("/")[-1].replace(".pdf", "")
|
||||
pdf_path_parent = os.path.dirname(pdf_path)
|
||||
|
||||
if output_dir:
|
||||
output_path = os.path.join(output_dir, pdf_name)
|
||||
else:
|
||||
output_path = os.path.join(pdf_path_parent, pdf_name)
|
||||
|
||||
output_image_path = os.path.join(output_path, 'images')
|
||||
|
||||
# 获取图片的父路径,为的是以相对路径保存到 .md 和 conent_list.json 文件中
|
||||
image_path_parent = os.path.basename(output_image_path)
|
||||
|
||||
pdf_bytes = open(pdf_path, "rb").read() # 读取 pdf 文件的二进制数据
|
||||
|
||||
if model_json_path:
|
||||
# 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
|
||||
model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
|
||||
else:
|
||||
model_json = []
|
||||
|
||||
# 执行解析步骤
|
||||
# image_writer = DiskReaderWriter(output_image_path)
|
||||
image_writer, md_writer = DiskReaderWriter(output_image_path), DiskReaderWriter(output_path)
|
||||
|
||||
# 选择解析方式
|
||||
# jso_useful_key = {"_pdf_type": "", "model_list": model_json}
|
||||
# pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
|
||||
if parse_method == "auto":
|
||||
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
|
||||
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
|
||||
elif parse_method == "txt":
|
||||
pipe = TXTPipe(pdf_bytes, model_json, image_writer)
|
||||
elif parse_method == "ocr":
|
||||
pipe = OCRPipe(pdf_bytes, model_json, image_writer)
|
||||
else:
|
||||
logger.error("unknown parse method, only auto, ocr, txt allowed")
|
||||
exit(1)
|
||||
|
||||
# 执行分类
|
||||
pipe.pipe_classify()
|
||||
|
||||
# 如果没有传入模型数据,则使用内置模型解析
|
||||
if not model_json:
|
||||
if model_config.__use_inside_model__:
|
||||
pipe.pipe_analyze() # 解析
|
||||
else:
|
||||
logger.error("need model list input")
|
||||
exit(1)
|
||||
|
||||
# 执行解析
|
||||
pipe.pipe_parse()
|
||||
|
||||
# 保存 text 和 md 格式的结果
|
||||
content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none")
|
||||
md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
|
||||
# 上传图像到图床
|
||||
image_urls = upload_image(output_image_path)
|
||||
md_content = replace_image_links(md_content, image_urls)
|
||||
|
||||
if is_json_md_dump:
|
||||
json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
|
||||
return 'sucess'
|
||||
|
||||
except Exception as e:
|
||||
logger.exception(e)
|
||||
return 'error'
|
||||
|
||||
def init_worker(devices, pdfs, gpu_index):
|
||||
"""
|
||||
Initialize a worker process to process a chunk of PDFs with a specific GPU.
|
||||
"""
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_index)
|
||||
process_pdf_chunk(pdfs, gpu_index)
|
||||
|
||||
def process_pdf_chunk(pdf_paths, worker_id):
|
||||
for pdf_path in tqdm(pdf_paths, desc=f"Worker {worker_id} Progress"):
|
||||
try:
|
||||
with open(pdf_path, 'rb') as file:
|
||||
pdf_reader = PyPDF2.PdfReader(file)
|
||||
print(os.path.basename(pdf_path).replace(".pdf", "").replace('_', '/'))
|
||||
status = pdf_parse_main(pdf_path, parse_method='auto', output_dir=output_dir)
|
||||
except PyPDF2.errors.PdfReadError:
|
||||
logger.error(f"{pdf_path} has been broken")
|
||||
except Exception as e:
|
||||
logger.error(f"{pdf_path} has an error: {e}")
|
||||
|
||||
def multiprocessing_setup(pdf_paths, num_gpus):
|
||||
num_processes_per_gpu = 2
|
||||
chunk_size = len(pdf_paths) // (num_gpus * num_processes_per_gpu)
|
||||
processes = []
|
||||
|
||||
# Create processes for each GPU
|
||||
for gpu_id in range(num_gpus):
|
||||
for process_id in range(num_processes_per_gpu):
|
||||
start_idx = (gpu_id * num_processes_per_gpu + process_id) * chunk_size
|
||||
end_idx = None if (gpu_id == num_gpus - 1 and process_id == num_processes_per_gpu - 1) else start_idx + chunk_size
|
||||
chunk = pdf_paths[start_idx:end_idx]
|
||||
|
||||
p = multiprocessing.Process(target=init_worker, args=([gpu_id], chunk, gpu_id))
|
||||
processes.append(p)
|
||||
p.start()
|
||||
|
||||
# Ensure all processes have completed
|
||||
for p in processes:
|
||||
p.join()
|
||||
|
||||
if __name__ == "__main__":
|
||||
_cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
# 此处更改路径
|
||||
pdf_dir = os.path.join(_cur_dir, "black_phosphorus_wulie/黑磷文献/黑磷文献-任务1-推荐官能团")
|
||||
output_dir = os.path.join(_cur_dir, "black_phosphorus_wulie/黑磷文献-任务1-推荐官能团_pdf2md")
|
||||
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
pdf_paths = sorted(glob(os.path.join(pdf_dir, "*.pdf")))
|
||||
|
||||
print("pdf数量:", len(pdf_paths))
|
||||
|
||||
# Number of GPUs
|
||||
num_gpus = 8
|
||||
|
||||
# Setup multiprocessing to handle PDFs across multiple GPUs
|
||||
# multiprocessing_setup(pdf_paths, num_gpus)
|
||||
|
||||
pdf_path = "/home/ubuntu/sas0/LYT/paper_dataset/black_phosphorus_wulie/黑磷文献/黑磷文献-任务1-推荐官能团/(P-O,P-O-P)Supporting_information.pdf"
|
||||
pdf_parse_main(pdf_path, parse_method='auto', output_dir=output_dir)
|
||||
245
clean/preprocess_mineru_new.py
Normal file
245
clean/preprocess_mineru_new.py
Normal file
@@ -0,0 +1,245 @@
|
||||
import re
|
||||
import os
|
||||
import requests
|
||||
import time
|
||||
import PyPDF2
|
||||
import multiprocessing as mp
|
||||
import math
|
||||
import sys
|
||||
import torch
|
||||
|
||||
from loguru import logger
|
||||
from glob import glob
|
||||
from tqdm import tqdm
|
||||
|
||||
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
|
||||
from magic_pdf.data.dataset import PymuDocDataset
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.config.enums import SupportedPdfParseMethod
|
||||
|
||||
# 图床配置
|
||||
IMGBED_URL = "http://localhost:40027/"
|
||||
# 检查imgbed url是否以/结尾
|
||||
if not IMGBED_URL.endswith('/'):
|
||||
IMGBED_URL += '/'
|
||||
token_endpoint = f"{IMGBED_URL}api/v1/tokens"
|
||||
upload_endpoint = f"{IMGBED_URL}api/v1/upload"
|
||||
|
||||
# 通过如下方式获取token
|
||||
# curl -X POST http://localhost:40027/api/v1/tokens -H "Content-Type: application/json" -d '{"email":"yt.li2@siat.ac.cn", "password":"lyt20000414."}'
|
||||
IMGBED_TOKEN = "6|QsBh5H7txY3Hd7ju1nzYKOBSdFQeL0YberydSFIH"
|
||||
|
||||
def replace_image_links(md_content: str, images_urls: dict) -> str:
|
||||
# 匹配 Markdown 中的图像链接形式,即: 
|
||||
pattern = r'!\[(.*?)\]\((.*?)\)'
|
||||
|
||||
def replace_link(match):
|
||||
# 提取出当前匹配到的图片路径
|
||||
image_path = match.group(2)
|
||||
# 检查该路径是否在字典中
|
||||
if image_path in images_urls:
|
||||
# 从字典中获取新的 URL
|
||||
new_url = images_urls[image_path]
|
||||
return f""
|
||||
return match.group(0)
|
||||
|
||||
# 使用 sub 函数进行替换
|
||||
updated_md_content = re.sub(pattern, replace_link, md_content)
|
||||
return updated_md_content
|
||||
|
||||
# 上传图片到LSKY Pro
|
||||
def upload_image(img_dir):
|
||||
headers = {
|
||||
"Authorization": f"Bearer {IMGBED_TOKEN}",
|
||||
'Accept': 'application/json'
|
||||
}
|
||||
|
||||
image_urls = {}
|
||||
os.makedirs(img_dir, exist_ok=True)
|
||||
img_names = os.listdir(img_dir)
|
||||
for image_name in img_names:
|
||||
retry = 0
|
||||
image_path = os.path.join(img_dir, image_name)
|
||||
while retry < 5: # 最大重试次数
|
||||
try:
|
||||
with open(image_path, 'rb') as image_file: # 确保文件在上传时是打开状态
|
||||
files = {'file': image_file}
|
||||
|
||||
# 上传文件
|
||||
response = requests.post(upload_endpoint, headers=headers, files=files)
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
if result['status']:
|
||||
image_url = result['data']['links']['url']
|
||||
image_urls['images/'+image_name] = image_url
|
||||
break # 上传成功,退出重试循环
|
||||
else:
|
||||
raise Exception(f"图片上传失败: {result['message']}")
|
||||
elif response.status_code == 429:
|
||||
# 429 响应,等待一段时间再重试
|
||||
wait_time = min(2 ** retry, 60) # 指数退避,最大等待 60 秒
|
||||
logger.warning(f"请求过于频繁,等待 {wait_time} 秒...")
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise Exception(f"HTTP请求出错: {response.status_code}")
|
||||
|
||||
retry += 1 # 增加重试次数
|
||||
time.sleep(1) # 在重试失败后稍等一下
|
||||
|
||||
except FileNotFoundError:
|
||||
logger.error(f"文件 {image_path} 不存在,请检查路径是否正确")
|
||||
return
|
||||
|
||||
return image_urls
|
||||
|
||||
def pdf_parse_main(
|
||||
pdf_path: str,
|
||||
output_dir: str = None
|
||||
):
|
||||
try:
|
||||
name_without_suff = os.path.basename(pdf_path).replace('.pdf', '')
|
||||
|
||||
# prepare env
|
||||
local_md_dir = os.path.join(output_dir, name_without_suff)
|
||||
local_image_dir = os.path.join(local_md_dir, 'images')
|
||||
image_dir = str(os.path.basename(local_image_dir))
|
||||
|
||||
os.makedirs(local_image_dir, exist_ok=True)
|
||||
|
||||
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
|
||||
local_md_dir
|
||||
)
|
||||
|
||||
# read bytes
|
||||
reader1 = FileBasedDataReader("")
|
||||
pdf_bytes = reader1.read(pdf_path) # read the pdf content
|
||||
# proc
|
||||
## Create Dataset Instance
|
||||
ds = PymuDocDataset(pdf_bytes)
|
||||
## inference
|
||||
if ds.classify() == SupportedPdfParseMethod.OCR:
|
||||
infer_result = ds.apply(doc_analyze, ocr=True)
|
||||
## pipeline
|
||||
pipe_result = infer_result.pipe_ocr_mode(image_writer)
|
||||
else:
|
||||
infer_result = ds.apply(doc_analyze, ocr=False)
|
||||
## pipeline
|
||||
pipe_result = infer_result.pipe_txt_mode(image_writer)
|
||||
### draw model result on each page
|
||||
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
|
||||
### draw layout result on each page
|
||||
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
|
||||
### draw spans result on each page
|
||||
pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
|
||||
### dump markdown
|
||||
md_content = pipe_result.dump_md(md_writer, os.path.join(local_md_dir, f"{name_without_suff}.md"), image_dir)
|
||||
### dump content list
|
||||
pipe_result.dump_content_list(md_writer, os.path.join(local_md_dir, f"{name_without_suff}_content_list.json"), image_dir)
|
||||
|
||||
# print(md_content)
|
||||
# 上传图像到图床
|
||||
image_urls = upload_image(local_image_dir)
|
||||
md_content = replace_image_links(md_content, image_urls)
|
||||
|
||||
md_writer.write_string(os.path.join(local_md_dir, f"{name_without_suff}.md"), md_content)
|
||||
|
||||
except Exception as e:
|
||||
logger.exception(e)
|
||||
return 'error'
|
||||
|
||||
def init_worker(pdfs, gpu_index, output_dir): # 添加output_dir参数
|
||||
"""
|
||||
Initialize a worker process to process a chunk of PDFs with a specific GPU.
|
||||
"""
|
||||
try:
|
||||
# 设置CUDA设备
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_index)
|
||||
|
||||
import torch
|
||||
device = torch.device('cuda:0')
|
||||
|
||||
print(f"进程 {os.getpid()} 启动于GPU {gpu_index}")
|
||||
print(f"处理 {len(pdfs)} 个PDF文件")
|
||||
|
||||
process_pdf_chunk(pdfs, device, output_dir) # 传递output_dir
|
||||
|
||||
except Exception as e:
|
||||
print(f"进程 {os.getpid()} 在GPU {gpu_index} 上初始化失败: {str(e)}")
|
||||
raise e
|
||||
|
||||
def process_pdf_chunk(pdf_paths, worker_id, output_dir):
|
||||
for pdf_path in tqdm(pdf_paths, desc=f"Worker {worker_id} Progress"):
|
||||
try:
|
||||
# 定期清理GPU内存
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
with open(pdf_path, 'rb') as file:
|
||||
pdf_reader = PyPDF2.PdfReader(file)
|
||||
print(os.path.basename(pdf_path).replace(".pdf", "").replace('_', '/'))
|
||||
pdf_parse_main(pdf_path, output_dir=output_dir)
|
||||
except PyPDF2.errors.PdfReadError:
|
||||
logger.error(f"{pdf_path} has been broken")
|
||||
except Exception as e:
|
||||
logger.error(f"{pdf_path} has an error: {e}")
|
||||
|
||||
def multiprocessing_setup(pdf_paths, num_gpus, output_dir):
|
||||
# 计算每个GPU处理的文件数量
|
||||
chunk_size = math.ceil(len(pdf_paths) / num_gpus)
|
||||
processes = []
|
||||
|
||||
# 为每个GPU创建一个进程
|
||||
for gpu_id in range(num_gpus):
|
||||
start_idx = gpu_id * chunk_size
|
||||
end_idx = min(len(pdf_paths), start_idx + chunk_size)
|
||||
chunk = pdf_paths[start_idx:end_idx]
|
||||
|
||||
p = mp.Process(target=init_worker, args=(chunk, gpu_id, output_dir)) # 传递output_dir
|
||||
processes.append(p)
|
||||
p.start()
|
||||
time.sleep(2)
|
||||
|
||||
# 等待所有进程完成
|
||||
for p in processes:
|
||||
p.join()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
_cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
# 此处更改路径
|
||||
# pdf_dir = os.path.join(_cur_dir, "二维材料剥离/二维材料剥离/石墨烯")
|
||||
# output_dir = os.path.join(_cur_dir, "二维材料剥离/mds/石墨烯")
|
||||
# pdf_dir = os.path.join(_cur_dir, "二维材料剥离/二维材料剥离/黑磷烯")
|
||||
# output_dir = os.path.join(_cur_dir, "二维材料剥离/mds/黑磷烯")
|
||||
pdf_dir = os.path.join(_cur_dir, "模型评估/模型评估")
|
||||
output_dir = os.path.join(_cur_dir, "模型评估/mds")
|
||||
# pdf_dir = os.path.join(_cur_dir, "金纳米棒/金纳米棒")
|
||||
# output_dir = os.path.join(_cur_dir, "金纳米棒/mds")
|
||||
# pdf_dir = os.path.join(_cur_dir, "钙钛矿/钙钛矿-复合材料")
|
||||
# output_dir = os.path.join(_cur_dir, "钙钛矿/mds/复合材料")
|
||||
# pdf_dir = os.path.join(_cur_dir, "钙钛矿/钙钛矿-LAPR/PDF论文")
|
||||
# output_dir = os.path.join(_cur_dir, "钙钛矿/mds/LAPR")
|
||||
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
pdf_paths = sorted(glob(os.path.join(pdf_dir, "*.pdf")))
|
||||
print("pdf数量:", len(pdf_paths))
|
||||
|
||||
# 输出目录中md文件的数量
|
||||
md_paths = sorted(glob(os.path.join(output_dir, "**", "*.md"), recursive=True))
|
||||
md_names = [os.path.basename(md_path) for md_path in md_paths]
|
||||
pdf_paths = [pdf_path for pdf_path in pdf_paths if os.path.basename(pdf_path).replace('.pdf', '.md') not in md_names]
|
||||
print("过滤后pdf数量:", len(pdf_paths))
|
||||
|
||||
# # 设置GPU数量
|
||||
# num_gpus = 2 # 先用2个GPU测试
|
||||
|
||||
# # 设置多进程启动方法
|
||||
# mp.set_start_method('spawn', force=True)
|
||||
|
||||
# try:
|
||||
# multiprocessing_setup(pdf_paths, num_gpus, output_dir)
|
||||
# except Exception as e:
|
||||
# print(f"程序执行出错: {str(e)}")
|
||||
|
||||
# pdf_path = "black_phosphorus/参考文献/2015.03-ACS Nano-Barbaros Özyilmaz-石墨烯接触、全封装的超薄黑磷基场效应晶体管中的空气稳定传输.pdf"
|
||||
for pdf_path in tqdm(pdf_paths):
|
||||
pdf_parse_main(pdf_path, output_dir=output_dir)
|
||||
319
clean/reparagraph.py
Executable file
319
clean/reparagraph.py
Executable file
@@ -0,0 +1,319 @@
|
||||
"""
|
||||
Author: Yutang LI
|
||||
Institution: SIAT-MIC
|
||||
Contact: yt.li2@siat.ac.cn
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import json
|
||||
from tqdm import tqdm
|
||||
import logging
|
||||
from openai import OpenAI
|
||||
from config import ReparagraphConfig
|
||||
|
||||
# 配置logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('reparagraph.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def get_true_level(title_info: list, config: ReparagraphConfig):
|
||||
source_title = json.dumps(title_info)
|
||||
instruction = """
|
||||
你是一个论文目录重排助手。
|
||||
有如下的JSON格式的目录信息,已知目录中每级标题的内容和行号。
|
||||
<PLACEHOLDER>
|
||||
请你重排该论文的目录层级,并为每级标题的level字段给出正确的层级关系,其中层级关系用数字(1,2,3,4)表示,数字越小,层级越高。
|
||||
注意:重排序目录要求多个1级标题的样式, 而非单一1级目录的样式。也就说level为1的标题数量必须大于1。
|
||||
通常情况下位于一级标题的有可能是:
|
||||
1. 论文的题目
|
||||
2. 论文的摘要(Abstract)
|
||||
3. 论文的介绍(Introduction)
|
||||
4. 论文的方法或实验(Methods or Experiment)
|
||||
5. 论文的结果或讨论(Result or Discussion)
|
||||
6. 论文的结论(Conclusion)
|
||||
7. 论文的参考文献(References)
|
||||
8. 论文的致谢(Acknowledgments)
|
||||
9. 论文的附录(Appendix)
|
||||
10. 论文的支撑信息(Supporting Information)
|
||||
有时候目录中存在序号,这时则优先使用序号顺序重建目录。
|
||||
|
||||
返回结果的时候严格遵守下列示例JSON格式:
|
||||
{ 'data': [
|
||||
{ 'title': 'A hierarchically porous MOF confined CsPbBr3 quantum dots: Fluorescence switching probe for detecting Cu (II) and melamine in food samples', 'line_num': 1, 'level': 1},
|
||||
...
|
||||
]
|
||||
"""
|
||||
# 创建 OpenAI 客户端
|
||||
client = OpenAI(api_key=config.openai_api_key, base_url=config.openai_base_url)
|
||||
messages = [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": instruction.replace("<PLACEHOLDER>", source_title)}
|
||||
]
|
||||
attempt = 0
|
||||
while attempt < config.max_retries:
|
||||
try:
|
||||
completion = client.chat.completions.create(
|
||||
model=config.model_name,
|
||||
stream=False, # 关闭流模式
|
||||
messages=messages,
|
||||
response_format={
|
||||
'type': 'json_object'
|
||||
}
|
||||
)
|
||||
|
||||
response = completion.choices[0].message.content
|
||||
response = json.loads(response)
|
||||
count_level_1 = sum(1 for item in response['data'] if item['level'] == 1)
|
||||
if count_level_1 == 1:
|
||||
attempt += 1
|
||||
messages.append({"role": "assistant", "content": str(response)})
|
||||
messages.append({"role": "user", "content": "上述目录中仅有1个1级标题, 请重新生成目录, 并保证目录中至少有两个1级标题。"})
|
||||
continue
|
||||
return response['data']
|
||||
|
||||
except (json.JSONDecodeError, Exception) as e:
|
||||
logging.error(f"尝试 {attempt + 1}/{config.max_retries} 失败: {str(e)}")
|
||||
if attempt == config.max_retries - 1:
|
||||
logging.error("达到最大重试次数,放弃操作")
|
||||
return "Error"
|
||||
|
||||
|
||||
def read_file_content(file_path: str):
|
||||
"""读取文件内容"""
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
return file.readlines()
|
||||
|
||||
def write_file_content(file_path: str, content: list):
|
||||
"""写入文件内容"""
|
||||
with open(file_path, 'w', encoding='utf-8') as file:
|
||||
file.writelines(content)
|
||||
|
||||
def extract_headings(lines: list):
|
||||
"""从文件内容中提取所有以#开头的行及其行号"""
|
||||
headings = []
|
||||
for line_num, line in enumerate(lines, 1):
|
||||
if re.match(r'^#', line.strip()):
|
||||
headings.append((line_num, line.strip()))
|
||||
return headings
|
||||
|
||||
def extract_references(lines: list, headings: list, remove_refs: bool = False):
|
||||
"""从文件内容中提取参考文献部分
|
||||
Args:
|
||||
lines: 文件内容列表
|
||||
headings: 标题信息列表
|
||||
remove_refs: 是否抹去参考文献内容
|
||||
Returns:
|
||||
dict: 包含起始点、结束点和内容的信息
|
||||
{
|
||||
'start': ref_start,
|
||||
'end': ref_end,
|
||||
'content': references,
|
||||
'updated_headings': updated_headings
|
||||
}
|
||||
"""
|
||||
# 在标题中查找REFERENCE
|
||||
ref_heading = None
|
||||
for line_num, heading in headings:
|
||||
if "REFERENCE" in heading.upper().replace(" ", ""):
|
||||
ref_heading = (line_num, heading)
|
||||
break
|
||||
|
||||
if not ref_heading and "ACKNOWLEDGEMENT" in heading.upper().replace(" ", ""):
|
||||
ref_heading = (line_num, heading)
|
||||
|
||||
if not ref_heading:
|
||||
# 用正则匹配常见的引用格式并删除
|
||||
# 包括:[数字]、数字.、(数字) 格式
|
||||
ref_pattern = r'^(\[\d+\]|\d+\.|\(\d+\))'
|
||||
lines = [line for line in lines if not re.match(ref_pattern, line.strip())]
|
||||
return {
|
||||
'start': -1,
|
||||
'end': -1,
|
||||
'content': None
|
||||
}, lines
|
||||
|
||||
ref_start = ref_heading[0] - 1 # 转换为0-based索引
|
||||
|
||||
# 查找下一个标题或文件结尾
|
||||
ref_end = len(lines)
|
||||
for i in range(ref_start + 1, len(lines)):
|
||||
if re.match(r'^#', lines[i].strip()):
|
||||
ref_end = i
|
||||
break
|
||||
|
||||
# 提取参考文献内容
|
||||
references = ''.join(lines[ref_start:ref_end])
|
||||
|
||||
# 如果需要抹去内容
|
||||
if remove_refs:
|
||||
lines[ref_start:ref_end] = []
|
||||
|
||||
# # 如果需要更新headings
|
||||
# updated_headings = headings
|
||||
# if remove_refs and ref_heading:
|
||||
# # 从headings中移除Reference标题
|
||||
# updated_headings = [h for h in headings if h[1].upper() != ref_heading[1].upper()]
|
||||
|
||||
return {
|
||||
'start': ref_start,
|
||||
'end': ref_end,
|
||||
'content': references,
|
||||
#'updated_headings': updated_headings
|
||||
}, lines
|
||||
|
||||
def update_headings(lines: list, heading_data: list):
|
||||
"""根据提供的标题数据更新Markdown文件内容"""
|
||||
# 统计heading_data中level==1的数量
|
||||
# count_level_1 = sum(1 for item in heading_data if item['level'] == 1)
|
||||
# flag = 2 if count_level_1 > 1 else 3 # 存在多个一级标题是为2否则为3
|
||||
|
||||
for heading in heading_data:
|
||||
line_num = heading['line_num'] - 1
|
||||
if heading['level'] >= 2:#flag:
|
||||
lines[line_num] = "**" + lines[line_num].replace("#", "").strip() + "**\n"
|
||||
return lines
|
||||
|
||||
|
||||
def detect_file_encoding(file_path: str):
|
||||
"""检测文件编码"""
|
||||
import chardet
|
||||
with open(file_path, 'rb') as f:
|
||||
raw_data = f.read(1024)
|
||||
result = chardet.detect(raw_data)
|
||||
return result['encoding']
|
||||
|
||||
# def read_file_content(file_path: str, config: ReparagraphConfig):
|
||||
# """读取文件内容,带大小检查和编码检测"""
|
||||
# file_size = os.path.getsize(file_path)
|
||||
# if file_size > config.max_file_size:
|
||||
# logging.warning(f"文件 {file_path} 超过最大限制 {config.max_file_size} bytes,跳过处理")
|
||||
# return None
|
||||
|
||||
# encoding = detect_file_encoding(file_path)
|
||||
# try:
|
||||
# with open(file_path, 'r', encoding=encoding) as file:
|
||||
# return file.readlines()
|
||||
# except UnicodeDecodeError:
|
||||
# logging.error(f"无法解码文件 {file_path},尝试使用utf-8")
|
||||
# with open(file_path, 'r', encoding='utf-8') as file:
|
||||
# return file.readlines()
|
||||
|
||||
def process_single_file(file_path: str, config: ReparagraphConfig):
|
||||
"""处理单个文件并返回处理后的内容"""
|
||||
# 读取文件内容
|
||||
lines = read_file_content(file_path)
|
||||
if lines is None:
|
||||
return None
|
||||
|
||||
# 提取并更新标题
|
||||
headings = extract_headings(lines)
|
||||
title_info = [{"title": heading, "line_num": line_num, "level": "unknown"}
|
||||
for line_num, heading in headings]
|
||||
|
||||
# 提取参考文献
|
||||
ref_info, lines = extract_references(lines, headings, remove_refs=config.remove_refs)
|
||||
if ref_info:
|
||||
logging.info("提取的参考文献:")
|
||||
logging.info(f"起始行: {ref_info['start'] + 1}")
|
||||
logging.info(f"结束行: {ref_info['end']}")
|
||||
logging.info("内容:")
|
||||
logging.info(ref_info['content'])
|
||||
# 更新headings
|
||||
# headings = ref_info['updated_headings']
|
||||
else:
|
||||
logging.warning("未找到参考文献部分")
|
||||
|
||||
# 删除reference后可能会导致标题的行号变化,重新索引
|
||||
headings = extract_headings(lines)
|
||||
title_info = [{"title": heading, "line_num": line_num, "level": "unknown"}
|
||||
for line_num, heading in headings]
|
||||
|
||||
new_headings = get_true_level(title_info, config)
|
||||
updated_lines = update_headings(lines, new_headings)
|
||||
|
||||
logging.info(f"文件处理完成: {file_path}")
|
||||
return updated_lines
|
||||
|
||||
def create_output_dir(input_path: str, config: ReparagraphConfig):
|
||||
"""创建输出目录"""
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
# 获取输入路径的父目录
|
||||
parent_dir = os.path.dirname(input_path)
|
||||
|
||||
# 创建带时间戳的输出目录
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
output_dir = os.path.join(parent_dir, f"{config.task_name}_{timestamp}")
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
return output_dir
|
||||
|
||||
def save_processed_file(file_path: str, content: list, output_dir: str, input_path: str):
|
||||
"""保存处理后的文件"""
|
||||
import os
|
||||
|
||||
# 如果是单个文件
|
||||
if os.path.isfile(input_path):
|
||||
output_path = os.path.join(output_dir, os.path.basename(file_path))
|
||||
else:
|
||||
# 保持目录结构
|
||||
relative_path = os.path.relpath(file_path, input_path)
|
||||
output_path = os.path.join(output_dir, relative_path)
|
||||
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.writelines(content)
|
||||
logging.info(f"已保存处理后的文件: {output_path}")
|
||||
|
||||
def reparagraph_file(path: str, config:ReparagraphConfig=None):
|
||||
"""处理单个文件或文件夹中的所有.md文件
|
||||
Args:
|
||||
path: 文件路径或文件夹路径
|
||||
config: ReparagraphConfig实例,包含处理配置
|
||||
Returns:
|
||||
str: 输出目录路径
|
||||
"""
|
||||
import os
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
if config is None:
|
||||
config = ReparagraphConfig()
|
||||
|
||||
# 创建输出目录
|
||||
output_dir = create_output_dir(path, config)
|
||||
logging.info(f"输出目录: {output_dir}")
|
||||
|
||||
# 如果是文件夹,递归获取所有.md文件
|
||||
if os.path.isdir(path):
|
||||
files = []
|
||||
for root, _, filenames in os.walk(path):
|
||||
for filename in filenames:
|
||||
if filename.endswith('.md'):
|
||||
files.append(os.path.join(root, filename))
|
||||
else:
|
||||
files = [path]
|
||||
|
||||
def process_and_save(file_path: str):
|
||||
content = process_single_file(file_path, config)
|
||||
if content is not None and not config.dry_run:
|
||||
save_processed_file(file_path, content, output_dir, path)
|
||||
|
||||
if config.parallel:
|
||||
# 使用线程池并行处理
|
||||
with ThreadPoolExecutor() as executor:
|
||||
list(tqdm(executor.map(process_and_save, files), total=len(files), desc="Processing files"))
|
||||
else:
|
||||
# 顺序处理
|
||||
for file_path in tqdm(files, desc="Processing files"):
|
||||
process_and_save(file_path)
|
||||
|
||||
logging.info(f"处理完成,共处理 {len(files)} 个文件")
|
||||
return output_dir
|
||||
33
clean/step0_pdfs2sql.py
Normal file
33
clean/step0_pdfs2sql.py
Normal file
@@ -0,0 +1,33 @@
|
||||
import os
|
||||
import tqdm
|
||||
import sqlite3
|
||||
import mysql.connector
|
||||
|
||||
def main():
|
||||
cur_path = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
TABLE_NAME = 'mp_cif_info'
|
||||
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
pdf_list = os.listdir(os.path.join(cur_path, 'mp_cif/pdfs'))
|
||||
|
||||
doi_list = [pdf.replace('.pdf', '') for pdf in pdf_list]
|
||||
|
||||
try:
|
||||
for doi in doi_list:
|
||||
sql = f"INSERT INTO {TABLE_NAME} (doi) VALUES (%s)"
|
||||
mysql_cursor.execute(sql, (doi,))
|
||||
mysql_connection.commit()
|
||||
finally:
|
||||
mysql_connection.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
88
clean/step1_modify_status_with_database.py
Normal file
88
clean/step1_modify_status_with_database.py
Normal file
@@ -0,0 +1,88 @@
|
||||
import os
|
||||
import tqdm
|
||||
import sqlite3
|
||||
import mysql.connector
|
||||
import PyPDF2
|
||||
|
||||
def read_dois_from_db(db_path):
|
||||
conn = sqlite3.connect(db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(f"SELECT doi FROM doi_status;")
|
||||
dois = [row[0] for row in cursor.fetchall()]
|
||||
conn.close()
|
||||
return dois
|
||||
|
||||
def main():
|
||||
cur_path = os.path.dirname(os.path.abspath(__file__))
|
||||
# db_path = os.path.join(cur_path, 'psk_high_cited', 'doi_status.db')
|
||||
# dois_db = read_dois_from_db(db_path)
|
||||
|
||||
# for doi in tqdm.tqdm(dois_db):
|
||||
# pdf = doi.replace('/','_').replace('<','_').replace('>','_').replace(':','_') + '.pdf'
|
||||
# pdf_path = os.path.join(cur_path, 'psk_high_cited/pdfs', pdf)
|
||||
# if os.path.exists(pdf_path):
|
||||
# conn = sqlite3.connect(db_path)
|
||||
# cursor = conn.cursor()
|
||||
# cursor.execute(f"UPDATE doi_status SET status = 'success' WHERE doi = '{doi}';")
|
||||
# conn.close()
|
||||
|
||||
###########################################################################################
|
||||
|
||||
TABLE_NAME = 'mp_cif_info'
|
||||
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
try:
|
||||
# 获取所有 doi
|
||||
mysql_cursor.execute(f"SELECT doi FROM {TABLE_NAME};")
|
||||
dois = [row[0] for row in mysql_cursor.fetchall()]
|
||||
|
||||
for doi in tqdm.tqdm(dois):
|
||||
# pdf = doi.replace('/','_').replace('<','_').replace('>','_').replace(':','_') + '.pdf'
|
||||
pdf = doi + '.pdf'
|
||||
|
||||
# 需要更改为你的pdf路径
|
||||
pdf_path = os.path.join(cur_path, 'mp_cif/pdfs', pdf)
|
||||
|
||||
if os.path.exists(pdf_path):
|
||||
try:
|
||||
# 尝试打开PDF文件
|
||||
with open(pdf_path, 'rb') as file:
|
||||
pdf_reader = PyPDF2.PdfReader(file) # 如果无法解析,可能抛出异常
|
||||
|
||||
# 如果文件成功打开和解析,更新数据库状态为 'success'
|
||||
query = f"UPDATE {TABLE_NAME} SET scihub_downloaded = %s WHERE doi = %s"
|
||||
mysql_cursor.execute(query, ('success', doi))
|
||||
mysql_connection.commit()
|
||||
|
||||
except (PyPDF2.errors.PdfReadError, PyPDF2.errors.PdfStreamError):
|
||||
# 如果 PDF 解析失败,将 scihub_downlowded 设置为 NULL
|
||||
query = f"UPDATE {TABLE_NAME} SET scihub_downloaded = %s WHERE doi = %s"
|
||||
mysql_cursor.execute(query, (None, doi)) # None 会映射为 SQL 中的 NULL
|
||||
mysql_connection.commit()
|
||||
|
||||
except Exception as e:
|
||||
# 其他异常处理
|
||||
print(f"处理 PDF {doi} 时出现未知错误: {e}")
|
||||
query = f"UPDATE {TABLE_NAME} SET scihub_downloaded = %s WHERE doi = %s"
|
||||
mysql_cursor.execute(query, (None, doi))
|
||||
mysql_connection.commit()
|
||||
|
||||
except mysql.connector.Error as error:
|
||||
print("Failed to insert record into MySQL table: {}".format(error))
|
||||
# 如果发生错误,撤回事务
|
||||
mysql_connection.rollback()
|
||||
|
||||
finally:
|
||||
# 关闭游标和连接
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
47
clean/step2_reserve_success_pdf_with_database.py
Normal file
47
clean/step2_reserve_success_pdf_with_database.py
Normal file
@@ -0,0 +1,47 @@
|
||||
import sqlite3
|
||||
import mysql.connector
|
||||
import tqdm
|
||||
import os
|
||||
|
||||
TABLE_NAME = 'mp_synthesis_papers_info'
|
||||
input('TABLE_NAME = {} ?'.format(TABLE_NAME))
|
||||
|
||||
cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
# MySQL connection setup
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
|
||||
try:
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
# 编写query语句
|
||||
# query = f"SELECT pdf_url FROM {TABLE_NAME} WHERE scihub_downlowded IN ('broken', 'timeout', 'failed') and pdf_url IS NOT NULL;"
|
||||
query = f"SELECT pdf_url FROM {TABLE_NAME} WHERE scihub_downlowded IS NULL AND pdf_url IS NOT NULL;"
|
||||
mysql_cursor.execute(query)
|
||||
records = mysql_cursor.fetchall()
|
||||
|
||||
for record in tqdm.tqdm(records):
|
||||
# pdf_path = os.path.join(cur_dir, record[0])
|
||||
# if os.path.exists(pdf_path):
|
||||
# os.remove(pdf_path)
|
||||
query = f"UPDATE {TABLE_NAME} SET pdf_url = NULL WHERE pdf_url = '{record[0]}';"
|
||||
mysql_cursor.execute(query)
|
||||
mysql_connection.commit()
|
||||
|
||||
# 提交更改到数据库
|
||||
mysql_connection.commit()
|
||||
|
||||
except mysql.connector.Error as error:
|
||||
print("Failed to insert record into MySQL table: {}".format(error))
|
||||
# 如果发生错误,撤回事务
|
||||
mysql_connection.rollback()
|
||||
|
||||
finally:
|
||||
# 关闭游标和连接
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
52
clean/step3_path_change_with_database.py
Normal file
52
clean/step3_path_change_with_database.py
Normal file
@@ -0,0 +1,52 @@
|
||||
import sqlite3
|
||||
import mysql.connector
|
||||
import tqdm
|
||||
import os
|
||||
|
||||
TABLE_NAME = 'mp_cif_info'
|
||||
input('TABLE_NAME = {} ?'.format(TABLE_NAME))
|
||||
|
||||
cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
# MySQL connection setup
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
|
||||
try:
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
# 获取所有下载为 success 的 doi
|
||||
query = f"SELECT doi, pdf_url FROM {TABLE_NAME} WHERE scihub_downloaded = 'success';"
|
||||
mysql_cursor.execute(query)
|
||||
results = mysql_cursor.fetchall()
|
||||
dois = [row[0] for row in results]
|
||||
pdf_urls = [row[1] for row in results]
|
||||
|
||||
for doi, pdf_url in tqdm.tqdm(zip(dois, pdf_urls), total=len(dois)):
|
||||
# 若是已经修改过的,则直接跳过
|
||||
if pdf_url is not None and pdf_url.split('/')[0] == 'mp_cif' and pdf_url.split('/')[1] == 'pdfs':
|
||||
continue
|
||||
# pdf = doi.replace('/','_').replace('<','_').replace('>','_').replace(':','_') + '.pdf'
|
||||
pdf = doi + '.pdf'
|
||||
# 新的路径
|
||||
pdf_path = os.path.join('mp_cif/pdfs', pdf)
|
||||
query = f"UPDATE {TABLE_NAME} SET pdf_url = '{pdf_path}' WHERE doi = '{doi}';"
|
||||
mysql_cursor.execute(query)
|
||||
mysql_connection.commit()
|
||||
|
||||
# 提交更改到数据库
|
||||
mysql_connection.commit()
|
||||
|
||||
except mysql.connector.Error as error:
|
||||
print("Failed to insert record into MySQL table: {}".format(error))
|
||||
# 如果发生错误,撤回事务
|
||||
mysql_connection.rollback()
|
||||
|
||||
finally:
|
||||
# 关闭游标和连接
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
51
clean/step4.2_modify_md_with_database.py
Normal file
51
clean/step4.2_modify_md_with_database.py
Normal file
@@ -0,0 +1,51 @@
|
||||
import mysql.connector
|
||||
import tqdm
|
||||
import os
|
||||
|
||||
TABLE_NAME = 'phosphorus_synthesis_info_new'
|
||||
input('TABLE_NAME = {} ?'.format(TABLE_NAME))
|
||||
|
||||
cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
# MySQL connection setup
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
|
||||
try:
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
# 获取所有已转换的 doi
|
||||
query = f"SELECT doi, md_url FROM {TABLE_NAME} WHERE en_text_content IS NOT NULL;"
|
||||
mysql_cursor.execute(query)
|
||||
results = mysql_cursor.fetchall()
|
||||
dois = [row[0] for row in results]
|
||||
md_urls = [row[1] for row in results]
|
||||
|
||||
for doi, md_url in tqdm.tqdm(zip(dois, md_urls), total=len(dois)):
|
||||
# 若是已经修改过的,则直接跳过
|
||||
dir_name = 'phosphorus'
|
||||
if md_url is not None and md_url.split('/')[0] == dir_name and md_url.split('/')[1] == 'mds':
|
||||
continue
|
||||
md_name = doi.replace('/','_').replace('<','_').replace('>','_').replace(':','_')
|
||||
md = md_name + '.md'
|
||||
md_path = os.path.join(dir_name+'/mds', md_name, md)
|
||||
query = f"UPDATE {TABLE_NAME} SET md_url = '{md_path}', convert2md = 'success' WHERE doi = '{doi}';"
|
||||
mysql_cursor.execute(query)
|
||||
mysql_connection.commit()
|
||||
|
||||
# 提交更改到数据库
|
||||
mysql_connection.commit()
|
||||
|
||||
except mysql.connector.Error as error:
|
||||
print("Failed to insert record into MySQL table: {}".format(error))
|
||||
# 如果发生错误,撤回事务
|
||||
mysql_connection.rollback()
|
||||
|
||||
finally:
|
||||
# 关闭游标和连接
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
424
clean/step4_preprocess_mineru_multi_with_database.py
Normal file
424
clean/step4_preprocess_mineru_multi_with_database.py
Normal file
@@ -0,0 +1,424 @@
|
||||
import re
|
||||
import os
|
||||
import json
|
||||
import copy
|
||||
import requests
|
||||
import time
|
||||
import shutil
|
||||
import uuid
|
||||
import sqlite3
|
||||
import PyPDF2
|
||||
import multiprocessing
|
||||
import mysql.connector
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
from loguru import logger
|
||||
from glob import glob
|
||||
from tqdm import tqdm
|
||||
from datetime import datetime
|
||||
import asyncio
|
||||
|
||||
from magic_pdf.pipe.UNIPipe import UNIPipe
|
||||
from magic_pdf.pipe.OCRPipe import OCRPipe
|
||||
from magic_pdf.pipe.TXTPipe import TXTPipe
|
||||
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
|
||||
import magic_pdf.model as model_config
|
||||
|
||||
model_config.__use_inside_model__ = True
|
||||
|
||||
|
||||
# 图床配置
|
||||
# IMGBED_URL = "http://localhost:40027/"
|
||||
IMGBED_URL = "http://172.20.103.171:40027/"
|
||||
# 检查imgbed url是否以/结尾
|
||||
if not IMGBED_URL.endswith('/'):
|
||||
IMGBED_URL += '/'
|
||||
token_endpoint = f"{IMGBED_URL}api/v1/tokens"
|
||||
upload_endpoint = f"{IMGBED_URL}api/v1/upload"
|
||||
|
||||
# 通过如下方式获取token
|
||||
# curl -X POST http://localhost:40027/api/v1/tokens -H "Content-Type: application/json" -d '{"email":"yt.li2@siat.ac.cn", "password":"lyt20000414."}'
|
||||
IMGBED_TOKEN = "6|QsBh5H7txY3Hd7ju1nzYKOBSdFQeL0YberydSFIH"
|
||||
|
||||
|
||||
def replace_image_links(md_content: str, images_urls: dict) -> str:
|
||||
# 匹配 Markdown 中的图像链接形式,即: 
|
||||
pattern = r'!\[(.*?)\]\((.*?)\)'
|
||||
|
||||
def replace_link(match):
|
||||
# 提取出当前匹配到的图片路径
|
||||
image_path = match.group(2)
|
||||
# 检查该路径是否在字典中
|
||||
if image_path in images_urls:
|
||||
# 从字典中获取新的 URL
|
||||
new_url = images_urls[image_path]
|
||||
return f""
|
||||
return match.group(0)
|
||||
|
||||
# 使用 sub 函数进行替换
|
||||
updated_md_content = re.sub(pattern, replace_link, md_content)
|
||||
return updated_md_content
|
||||
|
||||
|
||||
# 上传图片到LSKY Pro
|
||||
def upload_image(img_dir):
|
||||
headers = {
|
||||
"Authorization": f"Bearer {IMGBED_TOKEN}",
|
||||
'Accept': 'application/json'
|
||||
}
|
||||
|
||||
image_urls = {}
|
||||
img_names = os.listdir(img_dir)
|
||||
for image_name in img_names:
|
||||
retry = 0
|
||||
image_path = os.path.join(img_dir, image_name)
|
||||
while retry < 5: # 最大重试次数
|
||||
try:
|
||||
with open(image_path, 'rb') as image_file: # 确保文件在上传时是打开状态
|
||||
files = {'file': image_file}
|
||||
|
||||
# 上传文件
|
||||
response = requests.post(upload_endpoint, headers=headers, files=files)
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
if result['status']:
|
||||
image_url = result['data']['links']['url']
|
||||
image_urls['images/'+image_name] = image_url
|
||||
print(f"图片上传成功: {image_url}")
|
||||
break # 上传成功,退出重试循环
|
||||
else:
|
||||
raise Exception(f"图片上传失败: {result['message']}")
|
||||
elif response.status_code == 429:
|
||||
# 429 响应,等待一段时间再重试
|
||||
wait_time = 3
|
||||
# wait_time = min(2 ** retry, 10) # 指数退避,最大等待 10 秒
|
||||
# logger.warning(f"请求过于频繁,等待 {wait_time} 秒...")
|
||||
print(f"请求过于频繁,等待 {wait_time} 秒...")
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise Exception(f"HTTP请求出错: {response.status_code}")
|
||||
|
||||
retry += 1 # 增加重试次数
|
||||
time.sleep(1) # 在重试失败后稍等一下
|
||||
|
||||
except FileNotFoundError:
|
||||
logger.error(f"文件 {image_path} 不存在,请检查路径是否正确")
|
||||
return
|
||||
|
||||
return image_urls
|
||||
|
||||
# 保存图片到本地,并确保生成的文件名唯一
|
||||
def save_images_locally(img_dir, target_dir):
|
||||
if not os.path.exists(target_dir):
|
||||
os.makedirs(target_dir)
|
||||
|
||||
image_urls = {}
|
||||
|
||||
img_names = os.listdir(img_dir)
|
||||
|
||||
# 遍历图片并保存到目标文件夹
|
||||
for image_name in img_names:
|
||||
image_path = os.path.join(img_dir, image_name)
|
||||
|
||||
# 使用UUID生成唯一的文件名,以保持图片名称的唯一性
|
||||
unique_name = f"{uuid.uuid4()}{os.path.splitext(image_name)[1]}" # 保留原扩展名
|
||||
save_path = os.path.join(target_dir, unique_name)
|
||||
|
||||
try:
|
||||
# 复制文件到目标目录
|
||||
shutil.copy2(image_path, save_path)
|
||||
# 将图片名称与保存路径加入字典
|
||||
image_urls[f'images/{unique_name}'] = save_path
|
||||
print(f"图片保存成功: {save_path}")
|
||||
except FileNotFoundError:
|
||||
print(f"文件 {image_path} 不存在,跳过该图片")
|
||||
except Exception as e:
|
||||
print(f"保存图片 {image_name} 过程中发生错误: {e}")
|
||||
|
||||
return image_urls
|
||||
|
||||
def json_md_dump(
|
||||
pipe,
|
||||
md_writer,
|
||||
pdf_name,
|
||||
content_list,
|
||||
md_content,
|
||||
):
|
||||
# 写入模型结果到 model.json
|
||||
orig_model_list = copy.deepcopy(pipe.model_list)
|
||||
md_writer.write(
|
||||
content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
|
||||
path=f"{pdf_name}_model.json"
|
||||
)
|
||||
|
||||
# 写入中间结果到 middle.json
|
||||
md_writer.write(
|
||||
content=json.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4),
|
||||
path=f"{pdf_name}_middle.json"
|
||||
)
|
||||
|
||||
# text文本结果写入到 conent_list.json
|
||||
md_writer.write(
|
||||
content=json.dumps(content_list, ensure_ascii=False, indent=4),
|
||||
path=f"{pdf_name}_content_list.json"
|
||||
)
|
||||
|
||||
# 写入结果到 .md 文件中
|
||||
md_writer.write(
|
||||
content=md_content,
|
||||
path=f"{pdf_name}.md"
|
||||
)
|
||||
|
||||
|
||||
def pdf_parse_main(
|
||||
pdf_path: str,
|
||||
parse_method: str = 'auto',
|
||||
model_json_path: str = None,
|
||||
is_json_md_dump: bool = True,
|
||||
output_dir: str = None
|
||||
):
|
||||
"""
|
||||
执行从 pdf 转换到 json、md 的过程,输出 md 和 json 文件到 pdf 文件所在的目录
|
||||
|
||||
:param pdf_path: .pdf 文件的路径,可以是相对路径,也可以是绝对路径
|
||||
:param parse_method: 解析方法, 共 auto、ocr、txt 三种,默认 auto,如果效果不好,可以尝试 ocr
|
||||
:param model_json_path: 已经存在的模型数据文件,如果为空则使用内置模型,pdf 和 model_json 务必对应
|
||||
:param is_json_md_dump: 是否将解析后的数据写入到 .json 和 .md 文件中,默认 True,会将不同阶段的数据写入到不同的 .json 文件中(共3个.json文件),md内容会保存到 .md 文件中
|
||||
:param output_dir: 输出结果的目录地址,会生成一个以 pdf 文件名命名的文件夹并保存所有结果
|
||||
"""
|
||||
try:
|
||||
pdf_name = os.path.basename(pdf_path).split("/")[-1].replace(".pdf", "")
|
||||
pdf_path_parent = os.path.dirname(pdf_path)
|
||||
|
||||
if output_dir:
|
||||
output_path = os.path.join(output_dir, pdf_name)
|
||||
else:
|
||||
output_path = os.path.join(pdf_path_parent, pdf_name)
|
||||
|
||||
output_image_path = os.path.join(output_path, 'images')
|
||||
|
||||
# 获取图片的父路径,为的是以相对路径保存到 .md 和 conent_list.json 文件中
|
||||
image_path_parent = os.path.basename(output_image_path)
|
||||
|
||||
pdf_bytes = open(pdf_path, "rb").read() # 读取 pdf 文件的二进制数据
|
||||
|
||||
if model_json_path:
|
||||
# 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
|
||||
model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
|
||||
else:
|
||||
model_json = []
|
||||
|
||||
# 执行解析步骤
|
||||
# image_writer = DiskReaderWriter(output_image_path)
|
||||
image_writer, md_writer = DiskReaderWriter(output_image_path), DiskReaderWriter(output_path)
|
||||
|
||||
# 选择解析方式
|
||||
# jso_useful_key = {"_pdf_type": "", "model_list": model_json}
|
||||
# pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
|
||||
if parse_method == "auto":
|
||||
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
|
||||
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
|
||||
elif parse_method == "txt":
|
||||
pipe = TXTPipe(pdf_bytes, model_json, image_writer)
|
||||
elif parse_method == "ocr":
|
||||
pipe = OCRPipe(pdf_bytes, model_json, image_writer)
|
||||
else:
|
||||
logger.error("unknown parse method, only auto, ocr, txt allowed")
|
||||
exit(1)
|
||||
|
||||
# 执行分类
|
||||
pipe.pipe_classify()
|
||||
|
||||
# 如果没有传入模型数据,则使用内置模型解析
|
||||
if not model_json:
|
||||
if model_config.__use_inside_model__:
|
||||
pipe.pipe_analyze() # 解析
|
||||
else:
|
||||
logger.error("need model list input")
|
||||
exit(1)
|
||||
|
||||
# 执行解析
|
||||
pipe.pipe_parse()
|
||||
|
||||
# 保存 text 和 md 格式的结果
|
||||
content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none")
|
||||
md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
|
||||
# 上传图像到图床
|
||||
# image_urls = upload_image(output_image_path)
|
||||
# 保存图像到本地
|
||||
target_dir = "mp_cif/images"
|
||||
image_urls = save_images_locally(output_image_path, target_dir)
|
||||
md_content = replace_image_links(md_content, image_urls)
|
||||
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers',
|
||||
charset="utf8mb4", # 设置连接使用 utf8mb4
|
||||
collation="utf8mb4_unicode_ci" # 使用适当的 collation
|
||||
)
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
table = 'mp_cif_info'
|
||||
# path = 'phosphorus/pdfs/' + pdf_name + '.pdf'
|
||||
# print("path:", path)
|
||||
doi = os.path.basename(pdf_path).replace(".pdf", "").replace('_', '/')
|
||||
|
||||
try:
|
||||
# 编写query语句
|
||||
query = f"UPDATE {table} SET en_text_content = %s WHERE doi = %s"
|
||||
mysql_cursor.execute(query, (md_content, doi))
|
||||
print(f"{doi},md保存成功")
|
||||
|
||||
# 提交更改到数据库
|
||||
mysql_connection.commit()
|
||||
|
||||
except mysql.connector.Error as error:
|
||||
print("Failed to insert record into MySQL table: {}".format(error))
|
||||
# 如果发生错误,撤回事务
|
||||
mysql_connection.rollback()
|
||||
|
||||
finally:
|
||||
# 关闭游标和连接
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
|
||||
if is_json_md_dump:
|
||||
json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
|
||||
return 'sucess'
|
||||
|
||||
except Exception as e:
|
||||
logger.exception(e)
|
||||
return 'error'
|
||||
|
||||
|
||||
def check_doi_not_in_db(pdf_name, cursor):
|
||||
query = f"SELECT * FROM doi_status WHERE doi = ? AND convert_status = ? "
|
||||
cursor.execute(query, (pdf_name, 'unprocessed'))
|
||||
res = cursor.fetchone()
|
||||
if res:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def init_worker(devices, pdfs, gpu_index, process_id):
|
||||
"""
|
||||
Initialize a worker process to process a chunk of PDFs with a specific GPU.
|
||||
"""
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_index)
|
||||
process_pdf_chunk(pdfs, gpu_index, process_id)
|
||||
|
||||
def get_converted2md_dois():
|
||||
table = 'mp_cif_info'
|
||||
|
||||
dois = []
|
||||
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers',
|
||||
charset="utf8mb4", # 设置连接使用 utf8mb4
|
||||
collation="utf8mb4_unicode_ci" # 使用适当的 collation
|
||||
)
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
try:
|
||||
# 编写query语句
|
||||
query = f"SELECT doi FROM {table} WHERE en_text_content IS NOT NULL;"
|
||||
mysql_cursor.execute(query)
|
||||
res = mysql_cursor.fetchall()
|
||||
dois = [row[0] for row in res if row]
|
||||
except mysql.connector.Error as error:
|
||||
# 如果发生错误,撤回事务
|
||||
mysql_connection.rollback()
|
||||
finally:
|
||||
# 关闭游标和连接
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
return dois
|
||||
|
||||
def is_within_operational_hours(start_hour, end_hour):
|
||||
now = datetime.now().time() # 获取当前时间(不含日期)
|
||||
current_hour = now.hour # 获取当前小时
|
||||
|
||||
# 检查是否在晚上6点到第二天早上9点范围
|
||||
if start_hour > end_hour:
|
||||
return (current_hour >= start_hour or current_hour < end_hour) # 跨过午夜
|
||||
else:
|
||||
return start_hour <= current_hour < end_hour
|
||||
|
||||
def process_pdf_chunk(pdf_paths, gpu_index, process_id):
|
||||
for pdf_path in tqdm(pdf_paths, desc=f"Worker {gpu_index}_{process_id} Progress"):
|
||||
# 在规定时间内运行任务
|
||||
start_hour = 15 # 18点(晚上6点)
|
||||
end_hour = 9 # 9点(次日早上9点)
|
||||
|
||||
# 检查当前时间是否在允许的时间范围
|
||||
while True:
|
||||
if is_within_operational_hours(start_hour, end_hour):
|
||||
print("当前时间在任务运行区间内,开始处理PDF文件...")
|
||||
try:
|
||||
with open(pdf_path, 'rb') as file:
|
||||
pdf_reader = PyPDF2.PdfReader(file)
|
||||
print(os.path.basename(pdf_path).replace(".pdf", "").replace('_', '/'))
|
||||
status = pdf_parse_main(pdf_path, parse_method='auto', output_dir=output_dir)
|
||||
break # 执行结束,跳出循环
|
||||
except PyPDF2.errors.PdfReadError:
|
||||
logger.error(f"{pdf_path} has been broken")
|
||||
break # 执行异常,跳出循环
|
||||
except Exception as e:
|
||||
logger.error(f"{pdf_path} has an error: {e}")
|
||||
break # 执行异常,跳出循环
|
||||
else:
|
||||
# 当前时间不在允许的时间范围,阻塞任务
|
||||
print("当前时间不在运行区间,稍后重试...")
|
||||
time.sleep(60 * 60) # 沉睡1小时后再次检查
|
||||
|
||||
def multiprocessing_setup(pdf_paths, num_gpus):
|
||||
num_processes_per_gpu = 3
|
||||
chunk_size = len(pdf_paths) // (num_gpus * num_processes_per_gpu)
|
||||
processes = []
|
||||
|
||||
# Create processes for each GPU
|
||||
for gpu_id in range(num_gpus):
|
||||
for process_id in range(num_processes_per_gpu):
|
||||
start_idx = (gpu_id * num_processes_per_gpu + process_id) * chunk_size
|
||||
end_idx = None if (gpu_id == num_gpus - 1 and process_id == num_processes_per_gpu - 1) else start_idx + chunk_size
|
||||
chunk = pdf_paths[start_idx:end_idx]
|
||||
|
||||
p = multiprocessing.Process(target=init_worker, args=([gpu_id], chunk, gpu_id, process_id))
|
||||
processes.append(p)
|
||||
p.start()
|
||||
|
||||
# Ensure all processes have completed
|
||||
for p in processes:
|
||||
p.join()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
_cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
# 此处更改路径
|
||||
pdf_dir = os.path.join(_cur_dir, "mp_cif/pdfs")
|
||||
output_dir = os.path.join(_cur_dir, "mp_cif/mds")
|
||||
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
pdf_paths = sorted(glob(os.path.join(pdf_dir, "*.pdf")))
|
||||
|
||||
dois = get_converted2md_dois()
|
||||
print(len(dois))
|
||||
new_pdf_paths = pdf_paths[:]
|
||||
for path in tqdm(pdf_paths):
|
||||
doi = os.path.basename(path).replace(".pdf", "").replace('_', '/')
|
||||
if doi in dois:
|
||||
new_pdf_paths.remove(path)
|
||||
print(len(new_pdf_paths))
|
||||
|
||||
# Number of GPUs
|
||||
num_gpus = 8
|
||||
|
||||
# Setup multiprocessing to handle PDFs across multiple GPUs
|
||||
multiprocessing_setup(new_pdf_paths, num_gpus)
|
||||
|
||||
160
clean/stp1_bib2sql.py
Normal file
160
clean/stp1_bib2sql.py
Normal file
@@ -0,0 +1,160 @@
|
||||
import os
|
||||
import glob
|
||||
import mysql.connector
|
||||
import bibtexparser
|
||||
import tqdm
|
||||
|
||||
|
||||
TABLE_NAME = 'phosphorus_synthesis_info'
|
||||
input('你确定TABLE_NAME是{}吗?'.format(TABLE_NAME))
|
||||
|
||||
# phosphorus_synthesis
|
||||
bibs_dir = os.path.join(os.path.dirname(__file__), 'synthesis23-25')
|
||||
if_file_path = os.path.join(os.path.dirname(__file__), '2023JCR.xlsx')
|
||||
input('你确定导入文件夹是{}吗?'.format(bibs_dir))
|
||||
|
||||
# MySQL connection setup
|
||||
connection = mysql.connector.connect(
|
||||
host='localhost',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
cursor = connection.cursor()
|
||||
|
||||
|
||||
# Function to check if a table exists
|
||||
def check_table_exists(table_name):
|
||||
cursor.execute(f"""
|
||||
SELECT COUNT(*)
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = '{table_name}'
|
||||
""")
|
||||
return cursor.fetchone()[0] == 1
|
||||
|
||||
# Function to create the table if it doesn't exist
|
||||
def create_table(table_name):
|
||||
if not check_table_exists(table_name):
|
||||
query = f"""
|
||||
CREATE TABLE IF NOT EXISTS `{table_name}` (
|
||||
doi VARCHAR(255) PRIMARY KEY,
|
||||
unique_id VARCHAR(255),
|
||||
author TEXT,
|
||||
title TEXT,
|
||||
journal VARCHAR(255),
|
||||
year INT,
|
||||
volume VARCHAR(50),
|
||||
number VARCHAR(50),
|
||||
pages VARCHAR(50),
|
||||
month VARCHAR(50),
|
||||
issn VARCHAR(50),
|
||||
eissn VARCHAR(50),
|
||||
researcher_id TEXT,
|
||||
if2023 VARCHAR(50),
|
||||
if5 VARCHAR(50),
|
||||
journal_index VARCHAR(50),
|
||||
jcr_quartile VARCHAR(50),
|
||||
orcid TEXT,
|
||||
early_access_date VARCHAR(50),
|
||||
scihub_downlowded VARCHAR(50),
|
||||
convert2md VARCHAR(50),
|
||||
pdf_url TEXT,
|
||||
md_url TEXT,
|
||||
abstract TEXT,
|
||||
image_url JSON,
|
||||
text_content LONGTEXT
|
||||
);
|
||||
"""
|
||||
cursor.execute(query)
|
||||
|
||||
def record_exists(doi, table_name):
|
||||
query = f"SELECT COUNT(*) FROM `{table_name}` WHERE doi = %s"
|
||||
cursor.execute(query, (doi,))
|
||||
count = cursor.fetchone()[0]
|
||||
return count > 0
|
||||
|
||||
# Function to insert a record into the MySQL database
|
||||
def insert_record(entry, table_name):
|
||||
# 定义列名列表
|
||||
columns = [
|
||||
'doi', 'unique_id', 'author', 'title', 'journal', 'year', 'volume',
|
||||
'number', 'pages', 'month', 'issn', 'eissn', 'researcher_id', 'if2023', 'if5', 'journal_index', 'jcr_quartile',
|
||||
'orcid', 'early_access_date', 'scihub_downlowded', 'convert2md', 'pdf_url', 'md_url', 'abstract', 'image_url', 'text_content'
|
||||
]
|
||||
|
||||
# 构建SQL查询语句
|
||||
placeholders = ', '.join(['%s'] * len(columns))
|
||||
query = f"""
|
||||
INSERT INTO `{table_name}` ({', '.join(columns)})
|
||||
VALUES ({placeholders})
|
||||
"""
|
||||
|
||||
values = (
|
||||
entry.get('doi'),
|
||||
entry.get('unique-id'),
|
||||
entry.get('author'),
|
||||
entry.get('title'),
|
||||
entry.get('journal'),
|
||||
entry.get('year'),
|
||||
entry.get('volume'),
|
||||
entry.get('number', None),
|
||||
entry.get('pages', None),
|
||||
entry.get('month', None),
|
||||
entry.get('issn', None),
|
||||
entry.get('eissn', None),
|
||||
entry.get('researcherid-numbers', None),
|
||||
entry.get('if2023', None),
|
||||
entry.get('if5', None),
|
||||
entry.get('journal_index', None),
|
||||
entry.get('jcr_quartile', None),
|
||||
entry.get('ocrid-numbers', None),
|
||||
entry.get('earlyaccessdate', None),
|
||||
entry.get('scihub_downlowded', None),
|
||||
entry.get('convert2md', None),
|
||||
entry.get('pdf_url', None),
|
||||
entry.get('md_url', None),
|
||||
entry.get('abstract', None),
|
||||
entry.get('image_url', None),
|
||||
entry.get('text_content', None)
|
||||
)
|
||||
cursor.execute(query, values)
|
||||
|
||||
|
||||
|
||||
# 用pandas打开excel文件
|
||||
import pandas as pd
|
||||
df = pd.read_excel(if_file_path)
|
||||
# 替换所有的nan为None
|
||||
df = df.replace({pd.NA: None})
|
||||
|
||||
# Create the table if it doesn't exist
|
||||
create_table(TABLE_NAME)
|
||||
|
||||
bib_files = sorted(glob.glob(os.path.join(bibs_dir, '*.bib')))
|
||||
for bib_file in tqdm.tqdm(bib_files):
|
||||
# Read and parse the .bib file
|
||||
with open(bib_file, 'r') as bibtex_file:
|
||||
bib_database = bibtexparser.load(bibtex_file)
|
||||
for entry in bib_database.entries:
|
||||
entry = {k.lower(): v for k, v in entry.items()}
|
||||
journal = entry.get('journal')
|
||||
if journal is not None:
|
||||
journal_lower = journal.lower() # 将期刊名称转为小写以进行不区分大小写的匹配
|
||||
matching_journal = df[df['JournalName'].str.lower() == journal_lower] # 在DataFrame中查找该期刊
|
||||
if not matching_journal.empty:
|
||||
entry['if2023'] = matching_journal['IF2023'].values[0]
|
||||
entry['if5'] = matching_journal['IF5'].values[0]
|
||||
entry['journal_index'] = matching_journal['INDEX'].values[0]
|
||||
entry['jcr_quartile'] = matching_journal['Quartile'].values[0]
|
||||
|
||||
doi = entry.get('doi')
|
||||
# 先检查记录是否存在,同时doi不能为空
|
||||
if not record_exists(doi, TABLE_NAME) and doi is not None:
|
||||
insert_record(entry, TABLE_NAME)
|
||||
|
||||
# Commit the changes and close the connection
|
||||
connection.commit()
|
||||
cursor.close()
|
||||
connection.close()
|
||||
print("Data has been inserted into the database!")
|
||||
193
clean/stp1_excel2sql.py
Normal file
193
clean/stp1_excel2sql.py
Normal file
@@ -0,0 +1,193 @@
|
||||
import os
|
||||
import mysql.connector
|
||||
|
||||
|
||||
TABLE_NAME = 'crispr_papers_info'
|
||||
input('你确定TABLE_NAME是{}吗?'.format(TABLE_NAME))
|
||||
|
||||
# phosphorus_synthesis
|
||||
excels_dir = os.path.join(os.path.dirname(__file__), 'CRISPR/CRISPR_engineered')
|
||||
if_file_path = os.path.join(os.path.dirname(__file__), 'CRISPR/2023JCR.xlsx')
|
||||
input('你确定导入文件夹是{}吗?'.format(excels_dir))
|
||||
|
||||
# MySQL connection setup
|
||||
connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
cursor = connection.cursor()
|
||||
|
||||
|
||||
# Function to check if a table exists
|
||||
def check_table_exists(table_name):
|
||||
cursor.execute(f"""
|
||||
SELECT COUNT(*)
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = '{table_name}'
|
||||
""")
|
||||
return cursor.fetchone()[0] == 1
|
||||
|
||||
|
||||
# Function to create the table if it doesn't exist
|
||||
def create_table(table_name):
|
||||
if not check_table_exists(table_name):
|
||||
query = f"""
|
||||
CREATE TABLE IF NOT EXISTS `{table_name}` (
|
||||
doi VARCHAR(255) PRIMARY KEY,
|
||||
unique_id VARCHAR(255),
|
||||
author TEXT,
|
||||
title TEXT,
|
||||
journal VARCHAR(255),
|
||||
year INT,
|
||||
volume VARCHAR(50),
|
||||
number VARCHAR(50),
|
||||
pages VARCHAR(50),
|
||||
month VARCHAR(50),
|
||||
issn VARCHAR(50),
|
||||
eissn VARCHAR(50),
|
||||
researcher_id TEXT,
|
||||
if2023 VARCHAR(50),
|
||||
if5 VARCHAR(50),
|
||||
journal_index VARCHAR(50),
|
||||
jcr_quartile VARCHAR(50),
|
||||
orcid TEXT,
|
||||
early_access_date VARCHAR(50),
|
||||
scihub_downlowded VARCHAR(50),
|
||||
convert2md VARCHAR(50),
|
||||
pdf_url TEXT,
|
||||
md_url TEXT,
|
||||
abstract TEXT,
|
||||
image_url JSON,
|
||||
en_text_content LONGTEXT,
|
||||
cited_reference_count INT,
|
||||
doi_link TEXT,
|
||||
research_areas TEXT,
|
||||
unique_wos_id VARCHAR(255)
|
||||
);
|
||||
"""
|
||||
cursor.execute(query)
|
||||
|
||||
|
||||
def record_exists(doi, table_name):
|
||||
query = f"SELECT COUNT(*) FROM `{table_name}` WHERE doi = %s"
|
||||
cursor.execute(query, (doi,))
|
||||
count = cursor.fetchone()[0]
|
||||
return count > 0
|
||||
|
||||
|
||||
# Function to insert a record into the MySQL database
|
||||
def insert_record(entry, table_name):
|
||||
# 定义列名列表
|
||||
columns = [
|
||||
'doi', 'unique_id', 'author', 'title', 'journal', 'year', 'volume',
|
||||
'number', 'pages', 'month', 'issn', 'eissn', 'researcher_id', 'if2023', 'if5', 'journal_index', 'jcr_quartile',
|
||||
'orcid', 'early_access_date', 'scihub_downlowded', 'convert2md', 'pdf_url', 'md_url', 'abstract', 'image_url',
|
||||
'text_content', 'cited_reference_count', 'doi_link', 'research_areas', 'unique_wos_id'
|
||||
]
|
||||
|
||||
# 构建SQL查询语句
|
||||
placeholders = ', '.join(['%s'] * len(columns))
|
||||
query = f"""
|
||||
INSERT INTO `{table_name}` ({', '.join(columns)})
|
||||
VALUES ({placeholders})
|
||||
"""
|
||||
|
||||
values = (
|
||||
entry.get('doi'),
|
||||
entry.get('unique-id'),
|
||||
entry.get('author'),
|
||||
entry.get('title'),
|
||||
entry.get('journal'),
|
||||
entry.get('year'),
|
||||
entry.get('volume'),
|
||||
entry.get('number', None),
|
||||
entry.get('pages', None),
|
||||
entry.get('month', None),
|
||||
entry.get('issn', None),
|
||||
entry.get('eissn', None),
|
||||
entry.get('researcherid-numbers', None),
|
||||
entry.get('if2023', None),
|
||||
entry.get('if5', None),
|
||||
entry.get('journal_index', None),
|
||||
entry.get('jcr_quartile', None),
|
||||
entry.get('ocrid-numbers', None),
|
||||
entry.get('earlyaccessdate', None),
|
||||
entry.get('scihub_downlowded', None),
|
||||
entry.get('convert2md', None),
|
||||
entry.get('pdf_url', None),
|
||||
entry.get('md_url', None),
|
||||
entry.get('abstract', None),
|
||||
entry.get('image_url', None),
|
||||
entry.get('text_content', None),
|
||||
entry.get('cited_reference_count', None),
|
||||
entry.get('doi_link', None),
|
||||
entry.get('research_areas', None),
|
||||
entry.get('unique_wos_id', None)
|
||||
)
|
||||
cursor.execute(query, values)
|
||||
|
||||
|
||||
# 用pandas打开excel文件
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_excel(if_file_path)
|
||||
# 替换所有的nan为None
|
||||
df = df.replace({pd.NA: None})
|
||||
|
||||
# Create the table if it doesn't exist
|
||||
create_table(TABLE_NAME)
|
||||
|
||||
excels_file_list = []
|
||||
for file in os.listdir(excels_dir): # os.listdir('溶剂热文献-230505-swx-V3')
|
||||
if file.endswith('.xls'):
|
||||
excels_file_list.append(os.path.splitext(file)[0])
|
||||
|
||||
|
||||
for excels_file in excels_file_list:
|
||||
print(os.path.join(excels_dir, excels_file + '.xls'))
|
||||
|
||||
# 指定Excel文件路径
|
||||
file_path = os.path.join(excels_dir, excels_file + '.xls')
|
||||
|
||||
# 读取Excel文件
|
||||
excel_df = pd.read_excel(file_path)
|
||||
# 替换所有的nan为None
|
||||
excel_df = excel_df.replace({pd.NA: None})
|
||||
|
||||
# 显示DataFrame的前几行
|
||||
# print(df.head(5))
|
||||
for i in range(len(excel_df)):
|
||||
entry = dict()
|
||||
entry['doi'] = str(excel_df.loc[i, 'DOI'])
|
||||
entry['title'] = str(excel_df.loc[i, 'Article Title'])
|
||||
entry['journal'] = str(excel_df.loc[i, 'Source Title'])
|
||||
entry['abstract'] = str(excel_df.loc[i, 'Abstract'])
|
||||
entry['cited_reference_count'] = int(excel_df.loc[i, 'Cited Reference Count'])
|
||||
entry['year'] = int(excel_df.loc[i, 'Publication Year'])
|
||||
entry['doi_link'] = str(excel_df.loc[i, 'DOI Link'])
|
||||
entry['research_areas'] = str(excel_df.loc[i, 'Research Areas'])
|
||||
entry['unique_wos_id'] = str(excel_df.loc[i, 'UT (Unique WOS ID)'])
|
||||
|
||||
journal = entry.get('journal')
|
||||
if journal is not None:
|
||||
journal_lower = journal.lower() # 将期刊名称转为小写以进行不区分大小写的匹配
|
||||
matching_journal = df[df['JournalName'].str.lower() == journal_lower] # 在DataFrame中查找该期刊
|
||||
if not matching_journal.empty:
|
||||
entry['if2023'] = matching_journal['IF2023'].values[0]
|
||||
entry['if5'] = matching_journal['IF5'].values[0]
|
||||
entry['journal_index'] = matching_journal['INDEX'].values[0]
|
||||
entry['jcr_quartile'] = matching_journal['Quartile'].values[0]
|
||||
|
||||
doi = entry.get('doi')
|
||||
# 先检查记录是否存在,同时doi不能为空
|
||||
if not record_exists(doi, TABLE_NAME) and doi is not None:
|
||||
insert_record(entry, TABLE_NAME)
|
||||
|
||||
# Commit the changes and close the connection
|
||||
connection.commit()
|
||||
cursor.close()
|
||||
connection.close()
|
||||
print("Data has been inserted into the database!")
|
||||
65
clean/stp2.1_migrate_download_sqlite2mysql.py
Normal file
65
clean/stp2.1_migrate_download_sqlite2mysql.py
Normal file
@@ -0,0 +1,65 @@
|
||||
# 脚本是为了将SQLite数据库中的数据迁移到MySQL数据库中。
|
||||
# 专门针对使用sqlite阶段写的代码,如果后续直接对Mysql做操作,就不要用这个脚本
|
||||
|
||||
import sqlite3
|
||||
import mysql.connector
|
||||
|
||||
TABLE_NAME = 'phosphorus_synthesis_info'
|
||||
input('你确定TABLE_NAME是{}吗?'.format(TABLE_NAME))
|
||||
|
||||
# SQLite setup
|
||||
sqlite_connection = sqlite3.connect('/home/ubuntu/workplace/LYT/llm-agent/phosphorus/doi_status.db') # Ensure this is your actual SQLite database file
|
||||
sqlite_cursor = sqlite_connection.cursor()
|
||||
|
||||
# MySQL connection setup
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
# Define the SQLite query to retrieve data
|
||||
sqlite_query = "SELECT doi, status, pdf_url FROM doi_status" # Ensure these field names match your SQLite table
|
||||
|
||||
# Function to check if a record exists in the MySQL database
|
||||
def record_exists(doi, table_name):
|
||||
query = f"SELECT COUNT(*) FROM `{table_name}` WHERE doi = %s"
|
||||
mysql_cursor.execute(query, (doi,))
|
||||
count = mysql_cursor.fetchone()[0]
|
||||
return count > 0
|
||||
|
||||
# Function to update a record in the MySQL database
|
||||
def update_record(doi, scihub_downlowded, pdf_url, table_name):
|
||||
query = f"""
|
||||
UPDATE `{table_name}`
|
||||
SET scihub_downlowded = %s, pdf_url = %s
|
||||
WHERE doi = %s
|
||||
"""
|
||||
mysql_cursor.execute(query, (scihub_downlowded, pdf_url, doi))
|
||||
|
||||
# Fetch data from SQLite
|
||||
sqlite_cursor.execute(sqlite_query)
|
||||
rows = sqlite_cursor.fetchall()
|
||||
|
||||
# Iterate over SQLite rows and update MySQL records
|
||||
for row in rows:
|
||||
doi, scihub_downlowded, pdf_url = row
|
||||
if record_exists(doi, TABLE_NAME): # Replace with your actual MySQL table name
|
||||
update_record(doi, scihub_downlowded, pdf_url, TABLE_NAME) # Adjust table name if necessary
|
||||
else:
|
||||
# You can choose to handle non-existent DOI entries differently if necessary
|
||||
print(f"Record with DOI {doi} does not exist in MySQL database.")
|
||||
|
||||
|
||||
# Commit the changes to the MySQL database
|
||||
mysql_connection.commit()
|
||||
|
||||
# Close connections
|
||||
sqlite_cursor.close()
|
||||
sqlite_connection.close()
|
||||
mysql_cursor.close()
|
||||
mysql_connection.close()
|
||||
|
||||
print("Data migration from SQLite to MySQL completed successfully!")
|
||||
28
clean/stp2.2_remove_broken_pdf.py
Normal file
28
clean/stp2.2_remove_broken_pdf.py
Normal file
@@ -0,0 +1,28 @@
|
||||
import sqlite3
|
||||
import mysql.connector
|
||||
import tqdm
|
||||
import os
|
||||
|
||||
TABLE_NAME = 'phosphorus_synthesis_info'
|
||||
input('TABLE_NAME = {} ?'.format(TABLE_NAME))
|
||||
|
||||
cur_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
# MySQL connection setup
|
||||
mysql_connection = mysql.connector.connect(
|
||||
host='100.84.94.73',
|
||||
user='metadata_mat_papers',
|
||||
password='siat-mic',
|
||||
database='metadata_mat_papers'
|
||||
)
|
||||
mysql_cursor = mysql_connection.cursor()
|
||||
|
||||
|
||||
# 编写query语句
|
||||
query = f"SELECT pdf_url FROM {TABLE_NAME} WHERE scihub_downlowded = 'broken'"
|
||||
mysql_cursor.execute(query)
|
||||
records = mysql_cursor.fetchall()
|
||||
|
||||
for record in tqdm.tqdm(records):
|
||||
pdf_path = os.path.join(cur_dir, record[0])
|
||||
os.remove(pdf_path)
|
||||
211
clean/stp2_down_ipidea_multi.py
Normal file
211
clean/stp2_down_ipidea_multi.py
Normal file
@@ -0,0 +1,211 @@
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import tqdm
|
||||
import requests
|
||||
import subprocess
|
||||
import concurrent.futures
|
||||
import sqlite3
|
||||
from scidownl import scihub_download
|
||||
import logging
|
||||
import pymupdf
|
||||
|
||||
|
||||
NUM_PROCESSES = 32 # 设置并发进程数
|
||||
SCIHUB_URLS = [
|
||||
"https://sci-hub.st/",
|
||||
"https://sci-hub.se/",
|
||||
"https://sci-hub.ru/"
|
||||
]
|
||||
PROXY_SERVICE_URL = f"http://api.proxy.ipidea.io/getProxyIp?num={NUM_PROCESSES}&tag=static_balance&return_type=txt&lb=1&sb=0&flow=1&protocol=http"
|
||||
SINGLE_PROXY_SERVICE_URL = f"http://api.proxy.ipidea.io/getProxyIp?num=1&tag=static_balance&return_type=txt&lb=1&sb=0&flow=1&protocol=http"
|
||||
DOI_PATTERN = re.compile(r"DOI\s*=\s*\{(10\.\d{4,9}/[-._;()/:A-Z0-9]+)\}", re.IGNORECASE)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format='[%(levelname)s] | %(asctime)s | %(message)s')
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def get_directories(bib_dir_name, output_dirname):
|
||||
current_path = os.path.dirname(os.path.abspath(__file__))
|
||||
output_dir = os.path.join(current_path, output_dirname)
|
||||
bib_dir_path = os.path.join(current_path, bib_dir_name)
|
||||
db_path = os.path.join(current_path, "doi_status.db")
|
||||
return output_dir, bib_dir_path, db_path
|
||||
|
||||
def create_directory_if_not_exists(directory):
|
||||
os.makedirs(directory, exist_ok=True)
|
||||
|
||||
def fetch_proxies():
|
||||
proxies = []
|
||||
try:
|
||||
response = requests.get(PROXY_SERVICE_URL)
|
||||
if response.status_code == 200:
|
||||
proxy_list = response.text.strip().split('\r\n')
|
||||
for proxy in proxy_list:
|
||||
proxies.append({
|
||||
"http": f"http://{proxy}",
|
||||
"https": f"http://{proxy}",
|
||||
})
|
||||
if proxies:
|
||||
logger.info(f"Fetched proxies: {proxies}")
|
||||
return proxies
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching proxies: {e}")
|
||||
return None
|
||||
|
||||
def fetch_proxy():
|
||||
proxies = []
|
||||
try:
|
||||
response = requests.get(SINGLE_PROXY_SERVICE_URL)
|
||||
if response.status_code == 200:
|
||||
proxy_list = response.text.strip().split('\r\n')
|
||||
for proxy in proxy_list:
|
||||
proxies.append({
|
||||
"http": f"http://{proxy}",
|
||||
"https": f"http://{proxy}",
|
||||
})
|
||||
if proxies:
|
||||
logger.info(f"Fetched proxies: {proxies}")
|
||||
return proxies
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching proxies: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def read_dois_from_files(bib_dir_path):
|
||||
all_dois = []
|
||||
for bib_file_name in sorted(os.listdir(bib_dir_path)):
|
||||
if bib_file_name.endswith(".bib"):
|
||||
with open(os.path.join(bib_dir_path, bib_file_name), "r") as file:
|
||||
dois = DOI_PATTERN.findall(file.read())
|
||||
logger.info(f"{bib_file_name} has {len(dois)} doi(s)")
|
||||
all_dois.extend(dois)
|
||||
return list(set(all_dois))
|
||||
|
||||
def filter_downloaded_dois(all_dois, output_dir):
|
||||
for doi in os.listdir(output_dir):
|
||||
if doi.endswith(".pdf"):
|
||||
doi = doi.replace(".pdf", "").replace("_", "/")
|
||||
if doi in all_dois:
|
||||
all_dois.remove(doi)
|
||||
return all_dois
|
||||
|
||||
def read_dois_from_db(db_path, status):
|
||||
conn = sqlite3.connect(db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(f"SELECT doi FROM doi_status WHERE status = '{status}'")
|
||||
dois = [row[0] for row in cursor.fetchall()]
|
||||
conn.close()
|
||||
return dois
|
||||
|
||||
def write_doi_to_db(db_path, doi, output_dirname, status):
|
||||
conn = sqlite3.connect(db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("INSERT OR REPLACE INTO doi_status (doi, status, pdf_url) VALUES (?, ?, ?)", (doi, status, f"{output_dirname}/{doi.replace('/', '_')}.pdf"))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
def initialize_db(db_path):
|
||||
conn = sqlite3.connect(db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute('''
|
||||
CREATE TABLE IF NOT EXISTS doi_status (
|
||||
doi TEXT PRIMARY KEY,
|
||||
status TEXT,
|
||||
pdf_url TEXT
|
||||
)
|
||||
''')
|
||||
conn.commit()
|
||||
cursor.execute("PRAGMA journal_mode=WAL")
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
def download_doi(doi, output_dir, proxy, scihub_urls, db_path):
|
||||
success_dois, broken_dois, failed_dois, timeout_dois = [], [], [], []
|
||||
output_dirname = output_dir.split("/")[-1]
|
||||
for scihub_url in scihub_urls:
|
||||
output_path = os.path.join(output_dir, f"{doi.replace('/', '_')}.pdf")
|
||||
proxy_url = "https=" + proxy['https']
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['scidownl', 'download', '--doi', doi, '--out', output_path, '--scihub-url', scihub_url, '--proxy', proxy_url],
|
||||
check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
|
||||
)
|
||||
logger.info(result.stderr)
|
||||
|
||||
if "No pdf tag" in result.stderr:
|
||||
timeout_dois.append(doi)
|
||||
write_doi_to_db(db_path, doi, output_dirname, 'timeout')
|
||||
break
|
||||
elif "403" in result.stderr or "Unable to connect to proxy" in result.stderr or "504" in result.stderr or 'crawling_failed, error: HTTPSConnectionPool' in result.stderr:
|
||||
logger.warning("Proxy error detected, fetching new proxy.")
|
||||
proxy = fetch_proxy()[0]
|
||||
# time.sleep(2)
|
||||
continue
|
||||
elif result.stdout.strip() != '':
|
||||
try:
|
||||
# 尝试打开pdf文件
|
||||
with pymupdf.open(output_path) as pdf:
|
||||
logger.info(f"Downloaded {doi} successfully.")
|
||||
write_doi_to_db(db_path, doi, output_dirname, 'success')
|
||||
success_dois.append(doi)
|
||||
except:
|
||||
write_doi_to_db(db_path, doi, output_dirname, 'broken')
|
||||
logger.info(f"{doi}.pdf has been broken!")
|
||||
broken_dois.append(doi)
|
||||
break
|
||||
else:
|
||||
write_doi_to_db(db_path, doi, output_dirname, 'failed')
|
||||
break
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Error: {e}")
|
||||
failed_dois.append(doi)
|
||||
write_doi_to_db(db_path, doi, 'failed')
|
||||
continue
|
||||
|
||||
return success_dois, broken_dois, failed_dois, timeout_dois
|
||||
|
||||
def download_dois(all_dois, output_dir, db_path):
|
||||
success_dois, broken_dois, failed_dois, timeout_dois = [], [], [], []
|
||||
proxies = fetch_proxies()
|
||||
|
||||
with concurrent.futures.ProcessPoolExecutor(max_workers=NUM_PROCESSES) as executor:
|
||||
futures = []
|
||||
for i, doi in enumerate(all_dois):
|
||||
proxy = proxies[i % len(proxies)]
|
||||
futures.append(executor.submit(download_doi, doi, output_dir, proxy, SCIHUB_URLS, db_path))
|
||||
|
||||
for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc='Downloading DOIs', unit='doi'):
|
||||
result = future.result()
|
||||
if result:
|
||||
success, broken, failed, timeout = result
|
||||
success_dois.extend(success)
|
||||
broken_dois.extend(broken)
|
||||
failed_dois.extend(failed)
|
||||
timeout_dois.extend(timeout)
|
||||
|
||||
logger.info(f"Success: {len(success_dois)}, Broken: {len(broken_dois)}, Failed: {len(failed_dois)}, Timeout: {len(timeout_dois)}")
|
||||
|
||||
def main():
|
||||
bib_dir_name = "synthesis23-25"
|
||||
output_dirname = "synthesis23-25_pdfs"
|
||||
input('你确定是文件夹{}和{}吗?'.format(bib_dir_name, output_dirname))
|
||||
output_dir, bib_dir_path, db_path = get_directories(bib_dir_name, output_dirname)
|
||||
create_directory_if_not_exists(output_dir)
|
||||
|
||||
initialize_db(db_path)
|
||||
|
||||
all_dois = read_dois_from_files(bib_dir_path)
|
||||
logger.info(f"Total {len(all_dois)} doi(s)")
|
||||
|
||||
all_dois = filter_downloaded_dois(all_dois, output_dir)
|
||||
|
||||
all_dois = [doi for doi in all_dois if doi not in read_dois_from_db(db_path, 'success')]
|
||||
all_dois = [doi for doi in all_dois if doi not in read_dois_from_db(db_path, 'failed')]
|
||||
all_dois = [doi for doi in all_dois if doi not in read_dois_from_db(db_path, 'timeout')]
|
||||
|
||||
download_dois(all_dois, output_dir, db_path)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
18
config.py
Normal file
18
config.py
Normal file
@@ -0,0 +1,18 @@
|
||||
class ConfigFactory:
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
|
||||
class ReparagraphConfig(ConfigFactory):
|
||||
"""配置类"""
|
||||
def __init__(self):
|
||||
self.remove_refs = False
|
||||
self.max_file_size = 10 * 1024 * 1024 # 10MB
|
||||
self.backup = True
|
||||
self.dry_run = False
|
||||
self.parallel = False
|
||||
self.task_name = "result_reparagraph"
|
||||
self.openai_base_url = None
|
||||
self.openai_api_key = None
|
||||
self.model_name = None
|
||||
self.max_retries = 5
|
||||
23
dify-sty-agent-reparagraph.py
Normal file
23
dify-sty-agent-reparagraph.py
Normal file
@@ -0,0 +1,23 @@
|
||||
from clean.reparagraph import reparagraph_file
|
||||
from config import ConfigFactory
|
||||
|
||||
OPENAI_BASE_URL = "http://8.218.238.241:17935/v1"
|
||||
OPENAI_API_KEY = "sk-urFGAQRThR6pysea0aC93bD27fA34bA69811A9254aAaD8B2"
|
||||
MODEL_NAME = "deepseek-chat"
|
||||
|
||||
class ReparagraphConfig(ConfigFactory):
|
||||
"""配置类"""
|
||||
def __init__(self):
|
||||
self.remove_refs = True
|
||||
self.max_file_size = 50 * 1024 * 1024 # 50MB
|
||||
self.dry_run = False
|
||||
self.parallel = True
|
||||
self.task_name = "result_reparagraph"
|
||||
self.openai_base_url = OPENAI_BASE_URL
|
||||
self.openai_api_key = OPENAI_API_KEY
|
||||
self.model_name = MODEL_NAME
|
||||
self.max_retries = 5
|
||||
|
||||
config = ReparagraphConfig()
|
||||
reparagraph_file("/home/ubuntu/50T/LYT/datapipe/result/dify-sty-agent", config)
|
||||
# reparagraph_file("/home/ubuntu/50T/LYT/datapipe/result/Biocatalytic CsPbX3 Perovskite Nanocrystals A Self-Reporting Nanoprobe for Metabolism Analysis.md", config)
|
||||
0
examples/__init__.py
Normal file
0
examples/__init__.py
Normal file
22
examples/dify-sty-agent-reparagraph.py
Normal file
22
examples/dify-sty-agent-reparagraph.py
Normal file
@@ -0,0 +1,22 @@
|
||||
from clean.reparagraph import *
|
||||
from config import ConfigFactory
|
||||
|
||||
OPENAI_BASE_URL = "XXX"
|
||||
OPENAI_API_KEY = "XXX"
|
||||
MODEL_NAME = "deepseek-chat"
|
||||
|
||||
class ReparagraphConfig(ConfigFactory):
|
||||
"""配置类"""
|
||||
def __init__(self):
|
||||
self.remove_content = True
|
||||
self.max_file_size = 50 * 1024 * 1024 # 50MB
|
||||
self.backup = True
|
||||
self.dry_run = False
|
||||
self.parallel = True
|
||||
self.task_name = "result_reparagraph"
|
||||
self.openai_base_url = OPENAI_BASE_URL
|
||||
self.openai_api_key = OPENAI_API_KEY
|
||||
self.model_name = MODEL_NAME
|
||||
|
||||
config = ReparagraphConfig()
|
||||
reparagraph_file("/home/ubuntu/50T/LYT/datapipe/result/dify-sty-agent", config)
|
||||
118
reparagraph.py
118
reparagraph.py
@@ -1,118 +0,0 @@
|
||||
import re
|
||||
import json
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
OPENAI_BASE_URL = "http://8.218.238.241:17935/v1"
|
||||
OPENAI_API_KEY = "sk-urFGAQRThR6pysea0aC93bD27fA34bA69811A9254aAaD8B2"
|
||||
MODEL_NAME = "deepseek-chat"
|
||||
|
||||
def get_true_level(title_info: list, max_retries: int = 5):
|
||||
source_title = json.dumps(title_info)
|
||||
instruction = """
|
||||
有如下的JSON格式的标题信息,已知他们的标题内容和行号,请你在level字段给出正确的层级关系,层级关系用数字(1,2,3,4)表示,数字越小,层级越高。
|
||||
额外的层级关系说明:本层级关系要求存在多个1级标题而非仅一个1级标题。
|
||||
<PLACEHOLDER>
|
||||
返回结果的时候严格遵守下列示例JSON格式:
|
||||
{ 'data': [
|
||||
{ 'title': '# A hierarchically porous MOF confined CsPbBr3 quantum dots: Fluorescence switching probe for detecting Cu (II) and melamine in food samples', 'line_num': 1, 'level': 1},
|
||||
...
|
||||
]
|
||||
"""
|
||||
# 创建 OpenAI 客户端
|
||||
client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
completion = client.chat.completions.create(
|
||||
model=MODEL_NAME,
|
||||
stream=False, # 关闭流模式
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": instruction.replace("<PLACEHOLDER>", source_title)}
|
||||
],
|
||||
response_format={
|
||||
'type': 'json_object'
|
||||
}
|
||||
)
|
||||
|
||||
response = completion.choices[0].message.content
|
||||
response = json.loads(response)
|
||||
return response['data']
|
||||
|
||||
except (json.JSONDecodeError, Exception) as e:
|
||||
print(f"尝试 {attempt + 1}/{max_retries} 失败: {str(e)}")
|
||||
if attempt == max_retries - 1:
|
||||
return "Error"
|
||||
|
||||
|
||||
def extract_headings(file_path):
|
||||
"""提取markdown文件中所有以#开头的行及其行号"""
|
||||
headings = []
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
for line_num, line in enumerate(file, 1):
|
||||
if re.match(r'^#', line.strip()):
|
||||
headings.append((line_num, line.strip()))
|
||||
return headings
|
||||
|
||||
def extract_references(file_path, headings):
|
||||
"""提取参考文献部分"""
|
||||
# 在标题中查找REFERENCE
|
||||
ref_heading = None
|
||||
for line_num, heading in headings:
|
||||
if "REFERENCE" in heading.upper():
|
||||
ref_heading = (line_num, heading)
|
||||
break
|
||||
|
||||
if not ref_heading:
|
||||
return None
|
||||
|
||||
ref_start = ref_heading[0] - 1 # 转换为0-based索引
|
||||
|
||||
# 查找下一个标题或文件结尾
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
lines = file.readlines()
|
||||
|
||||
ref_end = len(lines)
|
||||
for i in range(ref_start + 1, len(lines)):
|
||||
if re.match(r'^#', lines[i].strip()):
|
||||
ref_end = i
|
||||
break
|
||||
|
||||
# 提取参考文献内容
|
||||
references = lines[ref_start:ref_end]
|
||||
return ''.join(references)
|
||||
|
||||
def update_headings(file_path, heading_data):
|
||||
"""根据提供的标题数据更新Markdown文件中的标题"""
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
lines = file.readlines()
|
||||
|
||||
# 统计heading_data中level==1的数量
|
||||
count_level_1 = sum(1 for item in heading_data if item['level'] == 1)
|
||||
flag = 3 if count_level_1 > 1 else 4 # 存在多个一级标题是为2否则为3
|
||||
|
||||
for heading in heading_data:
|
||||
line_num = heading['line_num'] - 1
|
||||
if heading['level'] >= flag:
|
||||
lines[line_num] = "**" + lines[line_num].replace("#", "").strip() + "**\n"
|
||||
|
||||
with open(file_path, 'w', encoding='utf-8') as file:
|
||||
file.writelines(lines)
|
||||
|
||||
if __name__ == "__main__":
|
||||
file_path = "/root/data50T/LYT/matagent/A hierarchically porous MOF confined CsPbBr3 quantum dots- Fluorescence switching probe for detecting Cu (II) and melamine in food samples.md"
|
||||
|
||||
# 提取并更新标题
|
||||
headings = extract_headings(file_path)
|
||||
title_info = [{"title": heading, "line_num": line_num, "level": "unknown"}
|
||||
for line_num, heading in headings]
|
||||
# result = get_true_level(title_info)
|
||||
# update_headings(file_path, result)
|
||||
|
||||
# 提取参考文献
|
||||
references = extract_references(file_path, headings)
|
||||
if references:
|
||||
print("提取的参考文献:")
|
||||
print(references)
|
||||
else:
|
||||
print("未找到参考文献部分")
|
||||
Reference in New Issue
Block a user