Publisher does not support the Fluid field type. Please do not contact asking when support will be available.

If you purchased an add-on from expressionengine.com, be sure to visit boldminded.com/claim to add the license to your account here on boldminded.com.

Ticket: Notice warning & memory exhaustion on large imports

Status	Resolved
Add-on / Version	DataGrab 4.2.0
Severity
EE Version	7.2.0

Niki Wolf

Nov 30, 2022

Sorry to come back with another problem. Same site - this time trying to run the big import on the staging server. The import on old DG had to be split into two parts - first part is 3000 entries, the second is the remainder, about 1200 or so entries. On the upgraded DG these are still split in this way.

Previously with DG4.0.5, it ran the 3000 entry update without overt issues that weren’t my own fault (the Assets images thing). And once I got the Assets thing squared away, re-running it was successful.

This time, on DG4.2.0, it’s getting stuck at about 2866* entries and not completing. The logs don’t report any errors, but if I dig into the HTML of the CP, it does end up with one.. it’s exhausting the memory.

Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 622592 bytes) in 
/srv/users/serverpilot/apps/client/system/ee/legacy/database/drivers/mysqli/mysqli_connection.php
 on line 112

*This is my best guess with rough math—the logs end at 8602 lines, there are roughly 3 lines per-entry, the first two lines are the ‘start’ phrases, and the last two lines are the last entry it was trying before it gave up. So 8598/3 is 2866.

The second attempt with the same file got slightly further - at 2946 entries, but did not complete entry #2946 and halted.

When I ran the import of 3000 entries on 4.0.5, the logs ended at 9166 lines - 47 entries added, 2944 updated & 9 that were flagged as the duplicate url-titles, totaling 3000.

Additionally, in the cell of the table in the DG settings under ‘Status’, it is showing a ‘NOTICE’ warning, instead of the ‘completed’ badge.

Notice
Undefined variable: timestamp

user/addons/datagrab/models/datagrab_model.php, line 800

		Severity: E_NOTICE

To be frank, this error seems to appear whether the import runs successfully or not (like on a two entry test), so I’m not entirely sure if it’s even helpful, or if it’s contributing to why the import isn’t finishing.

I did try to revert the install back to 4.0.5 and re-run the import with the same CSV file that was getting stuck on 4.2.0, and while the Notice errors do not appear, the import is still getting stuck with the memory exhaustion. We did include a .user.ini file to set the memory limit to -1 and time limit to 100 minutes, but it does not seem to make any difference - the exhaustion errors were the same both before and after adding the ini file.

Additional note—that same CSV file seemed to run fine on the production site which is still on EE5.3.0 & DG3.x.

Alternately, Caroline is trying to run the CLI importer but is running into other issues, I don’t know if it’s all related or what, but that may need to be a separate ticket if it isn’t.

Let me know if you want access to the staging server in this case since it is publicly accessible.

Thanks.

BoldMinded (Brian)

You’re going to have to increase your memory settings. This isn’t something I can fix. I explained it a bit in the docs: https://docs.boldminded.com/datagrab/faqs#is-datagrab-4-is-slower-than-previous-versions

While this might not explicitly be a speed issue it boils down to the same thing. The new version uses the EE models, so it uses more memory.

If you’re using the CLI or calling the ACT url via cURL in the command line then you’ll run into these issues. It’ll max out the php bc it’s trying to import thousands of entries via a single request. If you call the ACT directly in the browser with a low limit parameter, like 10 at a time, it’ll initiate a new request for each batch of 10 entries, and each requests resets the memory limit. Executing the import in the CP interface does the same thing. If you need to Import thousands of entries via CLI in what is basically a single request then your only option is to jack the memory up as high as you can in PHPs ini.

BoldMinded (Brian)

The main issue here is that DataGrab wasn’t really designed to handle importing thousands of entries. Even if you jack up the memory to import 6000 entries in one go, if the import file grows to 7000 you might be right back in the same spot. A proper queuing system, or a pub/sub system like Amazon SQS or RabbitMQ, is what should be used for mass data imports. If you’re not familiar with how those work, they basically publish the raw data, or message, into a queue and then the same system, or a separate system subscribes to the message queue and usually requests 1 message at a time and processes it. Depending on the size of the queue and how the system is built it could take minutes, hours, or days to get through a queue.

What we could do is setup a very basic “queue”. If you’re using cron to run the import CLI command instead of running a single command at let’s say 1am to import all 6000 entries, lets change it to several different cron commands spread out over a period of time. For example:

# run at 1am every day
0 1 * * * php system/ee/eecli.php import:run --id=2 --limit=2000
# run at 2am every day
0 2 * * * php system/ee/eecli.php import:run --id=2 --limit=2000 --offset=2000
# run at 3am every day
0 3 * * * php system/ee/eecli.php import:run --id=2 --limit=2000 --offset=4000

With this is should import 2000 entries every hour, and if the import takes less than an hour there will be a short break in between. Basically breaking it down to 3 different requests to each handle 2000 entries at a time.

The –offset parameter is new to the command, so you’ll need the build in the next comment for this to work. You might have to play with the limit, offset, and actual cron execution time depending on your situation.

BoldMinded (Brian)

Comment has been marked private.

Niki Wolf

We already did increase the memory settings - to the maximum the session allows. Both before and after this change, the memory exhaustion error was identical. It hits 1GB of memory and dies.

When you said, ” Executing the import in the CP interface does the same thing.”, did you mean that, when executing the import in the CP, it should be automatically setting a limit per session? If so, I’m not sure it is actually doing that, because the memory exhaustion was first experienced while running it from the CP. We did not try curl or the new CLI import until after.

I will try the patch provided and the limits & offsets to see if it does the trick.

Are the offset commands only available via the CLI? Or is it something that can be appended to the ACT URL also, like the limit?

Thanks!

BoldMinded (Brian)

The offset parameter is named “skip” if you’re accessing it via the url. ?ACT=1&skip=1000&limit=1000 - Skip is a poorly chosen name that I didn’t choose :/

When you hit the ACT url in the browser directly it’s the same as when you import through the CP interface. The CP interface just calls the ACT url in an iframe. If you set the limit to 1 in the import config, or hit the url at ?ACT=1&limit=1, it should import 1 at a time, and it performs an http redirect to reload the page to import the next entry etc. When that http reload happens, it’s creating a whole new PHP process, so you shouldn’t be running into any memory issues. If you’re importing 1000 at a time, then all 1000 happen in the same http request, and thus the same php process, and it holds it all in memory and will eventually cap out.

Niki Wolf

Ok. We have skip set to 0 and import batch is 3000 in the import settings. As I think I mentioned, this is all settings from the older DG, but a cloned one as the original did not work at all.

I just started a browser-tab run setting the limit to 1000 to start, then was going to do another 1000 and ‘skip’ the first 1k, until it’s done.

As you may have seen in the EE slack, we’re having some problems getting the EE CLI to run, so we may have to stick to old-style cron and set up the limit & skips chronologically until we can figure out what is going on to keep us from connecting to the DB via the EE CLI.

Niki Wolf

We got the EE CLI issues sorted out with Matt’s help, so we’ve bypassed the SQL connection issue.

So next issue, falls back to that Timestamp error I reported:

eecli.php import:run --id=43 --pass_key=CBE4E1C824B3E0A74841FA51358AF4FF --limit=1500
Starting: Wh1 Digital Ocean Partial 1 [Clone]... 
The following error occurred: 
E_WARNING: Undefined variable $timestamp 
File: /user/addons/datagrab/models/datagrab_model.php
Line: 801

When I ran the ACT URL in a separate browser window, it also reports those $timestamp errors - nearly one for every entry. I accidentally closed the tab that was in before I could get a count, but it was a lot (I ran 1000 for that try).

Can you look into this?

Niki Wolf

Comment has been marked private.

BoldMinded (Brian)

Thanks for sharing that change. I made the update locally and it’ll be in the next release.

#10

BoldMinded (Brian)

So what is the current status? Are you able to import all the entries?

#11

Niki Wolf

I think so, but the logs seemed incomplete, and the one that ran with the offset setting had a batch of errors at the beginning but still seemed to roll on. I wanted to try another import today to see if it was repeatable but I have a couple appointments that may keep me from being able to do so this afternoon.

So as soon as I can run them again and determine that all seemed to import, I will let you k on what those results are.

#12

Niki Wolf

Comment has been marked private.

#13

Niki Wolf

Comment has been marked private.

#14

BoldMinded (Brian)

Can you share the full import file? I want to try to import it myself and see what happens. Put it on Dropbox or something where I can download it.

#15

Niki Wolf

I have uploaded it to the same location as previous upload shares, and deleted the old ones. Its file size should be about 2.6MB.

If you need the second one with fewer entries (to run against the larger-limit-than-actual-entries), let me know and I will add that one also.

#16

BoldMinded (Brian)

Comment has been marked private.

#17

BoldMinded (Brian)

Update: Sent Niki new build on Slack

I added a new parameter php system/ee/eecli.php import:run –id=26 –limit=500 –offset=0 –batch=no will stop the import after 500 entries. Then php system/ee/eecli.php import:run –id=26 –limit=500 –offset=500 –batch=no will import the next 500. FWIW it only imported 800+ entries for me when I tried 1000. I didn’t inspect the logs close enough to find out why it didn’t import exactly 1000, so you might have to do that. I didn’t get any errors though.

#18

Niki Wolf

Comment has been marked private.

#19

BoldMinded (Brian)

Can you provide a CP login to a dev environment I can take a look at? I really don’t like going down this path of logging into people’s sites but it might be the quickest way to resolve this. I’d like to focus on fixing the url title issue first as it’s something I can’t replicate locally, then make a new ticket for the Cartthrob specific issues b/c that is sounds like something else entirely, and I might need FTP access as well b/c I don’t have CT setup locally and trying to replicate your environment locally will take significant amount of time.

#20

BoldMinded (Brian)

If you can provide the CP and FTP credentials edit this ticket and use the provided fields.

#21

Niki Wolf

Comment has been marked private.

#22

BoldMinded (Brian)

Comment has been marked private.

#23

Niki Wolf

I tested this build several times on my local, and so far so good, the imports that were previously inserting the -1 when it was not necessary, are no longer. I will send this to the V7 server and try it there also to make sure it’s solid on both.

#24

Niki Wolf

Hooray! I confirmed on the staging server it too is no longer appending the URL if it does not need to.

I will next test the product imports with duplicate titles to verify that still works where it is needed, and I’ll get a separate ticket opened for the CartThrob orders imports issue as well.